What is ProteoArk?


ProteoArk is a free web-based tool for proteomic data analysis. It accepts protein file from Proteome Discoverer and MAXQUANT for analysis. ProteoArk provides four sections including Labelled, label-free, Differential expression, and Data visualization. Users can select labelled or label-free sections based on the type of their analysis. If the samples are either SILAC or TMT labelled, choose the labelled sample section and if the samples are label-free, choose the label-free section for analysis. Both sections include normalization, p-value, fold change calculations, gene enrichment analysis, and data visualizations.

The differential expression analysis section accepts normalized protein data for analysis. It includes p-value, fold change calculation methods, and various visualization methods but does not have normalization. That is, the user can input the normalized data or the data (proteomics/ transcriptomics) which does not need any normalization for differential expression analysis.

The fourth section is only for data visualization and provides 15 different types of plots for analysis.

* For video tutorial please follow this link

Labeled data


In the labelled samples section, the analysis pipeline is different for technical and biological replicates. The analysis for biological replicates includes batch correction and is not included in the analysis of technical replicates.

Upload file

Step 1: Select the input data type (Technical or biological replicate).

Step 2: Provide the details on the number of test samples, control, and biological replicates of user data (Exclude the number of replicates).

Step 3: Upload the input file in any of the given formats (.csv, .tsv and .xlsx) and click submit.


Step 4: Drag and drop each of the replicates in the corresponding test/control sample box.

    • • If the file contains three biological replicates, all the first, second, and third labelled biological replicates of the samples should drag and drop into the corresponding containers in the test sample section. Similarly, drag and drop the control samples also.
      • Click on the next button after grouping the samples in batches.

  • Normalization

    Step 5: Users can choose normalization methods Sum, Quantile, Internal reference scaling (IRS) normalization, Z-score and Trimmed mean of M-values (TMM).

    • • Users can delete the contaminants by selecting ‘remove contaminants’ and can see the list of contaminants by downloading it.
      • The user can replace the missing values present in the input file with any value, and the default is zero If the file does not contain a Gene symbol, ProteoArk provides an option for conversion of protein accession to gene symbol for further analysis (Those accessions which cannot convert into Gene symbol will remain with protein accession only)
      • Click next.

  • Step 6: Users can rename samples and click submit (Note that it is not mandatory but the name provided in this section will consider as sample names for further analysis).


  • Step 7: Based on the user’s choice of normalization method, data will be normalized and represented using PCA analysis and Box plot. If the experiment is a biological replicate user has the option to perform the batch correction. ProteoArk provides two methods for batch correction (Combat and Limma). Batch-corrected data will be represented using a box plot.

      • Due to some errors in the mass-spec, some proteins may show the abundance value only in one replicate (not in the other 2 replicates in the case of triplicates). ProteoArk will consider this an error and remove these proteins from the analysis. Users can download the deleted proteins from the result page.

  • Differential expression analysis

    Step 8: In differential expression analysis, the p-value and fold-change can be calculated either by

    • • LIMMA (R package) implemented by DOCKER for differential expression analysis.

    Or

      • By choosing different methods for p-value (Welch’s t-test, Two-Sample t-Test, One-way ANOVA, and Two-way ANOVA) and fold-change calculation.
      • The significant cut-off value for fold change and p-value can be given by the user, and it is used to identify the differentially expressed (DE) proteins in the analysis.

    • • The result page provides a bar chart to represent the proteins that are upregulated and downregulated in every sample.

    • • The users can download the full result that includes normalization and differential expression calculation and the file which contains only the differentially expressed proteins separately (in .csv format)
      • Based on the normalization methods, Volcano plots will be generated for each test-control comparison in the analysis. The significant cut-off value used to plot the volcano is considered based on the value given by the user on the previous page. But the user can reset the significant cut-off value on the result page also. The upregulated proteins and downregulated proteins are represented in red and green in color respectively. The insignificant proteins are in black color

    • • Heatmap represents the fold-change of statistically significant proteins that are differentially expressed in one or more samples
      • ProteoArk provides two methods for clustering, ‘Hierarchical clustering’ or ‘k-means’. For k-means, the number of clusters in the dataset (i.e., k) needs to be pre-determined, which can be done using the elbow method provided on the same page. It works by finding WCSS (Within-Cluster Sum of Square) i.e., the sum of the square distance between points in a cluster and the cluster centroid
      • The K- value will be the point where the elbow gets created (Elbow point). Users can not go for K-means if the number of differentially expressed proteins is less than 8 (Note: 8 is the total number of DE proteins in all datasets)

    • • For hierarchical clustering, the users need not specify the number of clusters
      • The color bar of the heatmap is customized in such a way that the upregulated proteins will be in red and downregulated proteins in green. If the protein does not have differential expression in a sample, it will be in black

  • Functional enrichment analysis, KEGG pathway visualization and STRING analysis

  • Step 9: User can perform functional enrichment analysis on the differentially expressed proteins in ProteoArk. Users can select the p-value threshold and species name for the same. If the user needs results without any filter, they can provide ‘1’ as a significant p-value cut-off. Gene ontology results will be represented using a circular bar plot and can be downloaded as SVG format. Each bar in the plot is labelled by the GO number. Users can also download the entire result for the same in .csv format


      •User can view all the KEGG pathways associated with the differentially expressed genes from the provided list. The genes will be highlighted in red (upregulated) or green (downregulated) color in the corresponding pathway mapUser can view all the KEGG pathways associated with the differentially expressed genes from the provided list. The genes will be highlighted in red (upregulated) or green (downregulated) color in the corresponding pathway map
      • Users can download pathway images in PNG format

      • Using Stringdb API ProteoArk will display the protein-protein interaction for differentially expressed genes. The STRING interactions can be downladed as an image and the resulting data in CSV format.

  • Label-free data


    If the user has an intensity of peptide and wants to do abundance calculation, the user can choose the corresponding option, if not, they will be redirected to the protein grouping page which will work the same as labelled data analysis.


    Abundance calculation

    Step 1: Choose the method for quantification, iBAQ or TOP3.

    • • Enter the enzyme that is used for protein digestion in sample preparation
      • Upload the reference sequence file that was used for PD or MaxQuant during the database search and also need to select the source of the file i.e., if the fasta file is from RefSeq database, select NCBI or if it was downloaded from UniProt select UniProt

      • Choose the file for analysis and the file should contain peptide sequence, intensity, and protein accession (RefSeq protein accession, gene symbol or UniProt accession).
      • Provide minimum and maximum length of peptides and missed cleavage same as in the PD /MaxQuant search
      • Click submit

      • Quantification of only one dataset is possible at a time. Users can repeat this process for multiple datasets and can combine the quantified file for further analysis
      • After the quantification and compilation of the dataset, choose ‘NO’ in the “Required protein quantification” box and this leads to the analysis page.Further analysis is the same as the labelled samples
  • Upload file

    Step 2: Select the input data type (Technical or biological replicate)

    Step 3: Provide the details on the number of test samples, control, and biological replicates of user data (Exclude the number of replicates)

    Step 4: Upload the input file in any of the given formats (.csv, .tsv, and xlsx) and click submit


    Step 5: Drag and drop each of the replicates in the corresponding containers.(The columns which have abundance or intensity will show on the left side of this page. If the input file does not contain any columns as abundance/intensity, all the columns will be displayed on this page)

    • • If the file contains three biological replicates, all the first, second, and third labelled biological replicates of the samples should drag and drop into the first, second and third sample box respectively in the test sample section. Similarly, drag and drop the control samples also (replicates in each box will treat as one batch)
      • If the file contains technical replicates, drag and drop all the replicates of each test and control samples into the corresponding box
      • Click on the next button after grouping the samples

  • Normalization

    Step 6: Users can choose normalization methods Sum, Quantile, Internal reference scaling (IRS) normalization, Z-score and Trimmed mean of M-values (TMM).

    • • Users can delete the contaminants by selecting ‘remove contaminants’ and can see the list of contaminants by downloading it
      • The user can replace the missing values present in the input file with any value, and the default is zero
      • If the file does not contain a Gene symbol, ProteoArk provides an option for conversion of protein accession to gene symbol for further analysis (Those accessions which cannot convert into Gene symbol will remain as the protein accession only)
      • Click Next

  • Step 7: Users can rename samples (Note: It is not mandatory but the name provided in this section will consider as sample names for further analysis) and click submit


    Step 8: Based on the user’s choice of normalization method, data will be normalized and represented using PCA analysis and Box plot. If the experiment is a biological replicate user has the option to perform the batch correction. ProteoArk provides two methods for batch correction (Combat and Limma). Batch-corrected data will be represented using a box plot

    • • Due to some errors in the mass-spec, some proteins may show the abundance value only in one replicate (not in the other 2 replicates in the case of triplicates). ProteoArk will consider this an error and remove these proteins from the analysis. Users can download the deleted proteins from the result page

  • Differential expression analysis

    Step 9: In differential expression analysis, the p-value and fold-change can be calculated either by

    • • LIMMA (R package) implemented by DOCKER for differential expression analysis
  • Or

    • • By choosing different methods for p-value (Welch’s t-test, Two-Sample t-Test, One-way ANOVA, and Two-way ANOVA) and fold-change calculation
      • The significant cut-off value for fold change and p-value can be given by the user, and it is used to identify the differentially expressed proteins in the analysis

      • The result page provides a bar chart to represent the proteins that are upregulated and downregulated in every sample.

      • The users can download the full result that includes normalization and differential expression calculation and the file which contains only the differentially expressed proteins separately (in CSV format)
      • Volcano plot is provided for each sample-control comparison in the analysis. The significant cut-off value used to plot the volcano is the value given by the user on the previous page
      • The upregulated proteins and downregulated proteins are represented in red and green in color respectively. The insignificant proteins are in black color

      • Heatmap represents the fold change of statistically significant proteins that are differentially expressed in one or more samples
      • ProteoArk provides two methods for clustering, ‘Hierarchical clustering’ or k-means. For k-means, the number of clusters in the dataset (i.e., k) needs to be pre-determined, which can be done using the elbow method provided on the same page. It works by finding WCSS (Within-Cluster Sum of Square) i.e., the sum of the square distance between points in a cluster and the cluster centroid
      • The K- value will be the point where the elbow gets created (Elbow point). Users can not go for K-means if the number of differentially expressed proteins is less than 8 (8 is the total number of DE proteins in all datasets)

    • • For hierarchical clustering, the users don’t need to give the number of clusters
      • The color bar of the heatmap is customized in such a way that the upregulated proteins will be in red and, downregulated proteins in green. If the protein does not have differential expression in a sample, it will be in black

    Functional enrichment analysis, KEGG pathway visualization and STRING analysis

    Step 10: User can perform functional enrichment analysis on the differentially expressed proteins in ProteoArk. Users can select the p-value threshold for the same. If the user needs results without any filter, they can provide 1 as a significant p-value cut-off. Gene ontology results will be represented using a circular bar plot. Each bar in the plot is labelled by the GO number. Users can also download the entire result for the same


    • • User can view all the KEGG pathways of the differentially expressed genes from the provided list. And the genes will be highlighted in color in the corresponding pathway map
      • Users can download pathway images in PNG format

      • Using Stringdb API ProteoArk will display the protein-protein interaction for differentially expressed genes.

  • Differential expression analysis


    The differential expression analysis section accepts either normalized proteomic data or proteomic data which does not need any normalization for analysis

    Upload file

    Step 1: Provide the details on the number of normalized test samples and normalized controls (Exclude the number of replicates)

    Step 2: Upload the input file in any of the given formats (.csv, .tsv or .xlsx) and click submit


    Step 3: Drag and drop each of the replicates in the corresponding containers

    • • Replicates of the samples should dragged and dropped into the corresponding containers in the test sample and control samples
      • Click on the next button after grouping the samples

  • Differential expression analysis

    Step 7: Users can rename samples (It is not mandatory but the name provided in this section will consider as sample names for further analysis) and click submit

    • • Select protein accession or gene symbol column from the input file
  • Step 8: For differential expression analysis, the p-value and fold-change can be calculated either by

    • • LIMMA (R package) implemented by DOCKER for differential expression analysis
  • Or

    • • By choosing different methods for p-value (Welch’s t-test, Two-Sample t-Test, One-way ANOVA, and Two-way ANOVA) and fold-change calculation
      • The significant cut-off value for fold change and p-value can be given by the user, and it is used to identify the differentially expressed proteins in the analysis
      • The result page provides a bar chart to represent the proteins that are upregulated and downregulated in every sample.

      • The users can download the full result that includes normalization and differential expression calculation and the file which contains only the total differentially expressed unique ? proteins separately (in CSV format)
      • Volcano plot is provided for each sample-control comparison in the analysis. The significant cut-off value used to plot the volcano is the value given by the user on the previous page
      • The upregulated proteins and downregulated proteins are represented in red and green in color respectively. The insignificant proteins are in black color

      • Heatmap represents the fold change of statistically significant proteins that are differentially expressed in one or more samples
      • ProteoArk provides two methods for clustering, ‘Hierarchical clustering’ or k-means. For k-means, the number of clusters in the dataset (i.e., k) needs to be pre-determined, which can be done using the elbow method provided on the same page. It works by finding WCSS (Within-Cluster Sum of Square) i.e., the sum of the square distance between points in a cluster and the cluster centroid
      • The K- value will be the point where the elbow gets created (Elbow point). Users can not go for K-means if the number of differentially expressed proteins is less than 8 (8 is the total number of DE proteins in all datasets)

      • For hierarchical clustering, the users don’t need to give the number of clusters
      • The color bar of the heatmap is customized in such a way that the upregulated proteins will be in red and, downregulated proteins in green. If the protein does not have differential expression in a sample, it will be in black

  • Functional enrichment analysis, KEGG pathway visualization and STRING analysis

    Step 9: User can perform functional enrichment analysis on the differentially expressed proteins in ProteoArk. Users can select the p-value threshold for the same. If the user needs results without any filter, they can provide 1 as a significant p-value cut-off. Gene ontology results will be represented using a circular bar plot. Each bar in the plot is labeled by the GO number. Users can also download the entire result for the same


    • • User can view all the KEGG pathways of the differentially expressed genes from the provided list. And the genes will be highlighted in color in the corresponding pathway map
      • Users can download pathway images in PNG format

      • Using Stringdb API ProteoArk will display the protein-protein interaction for differentially expressed genes.

  • Multiple control


    ProteoArk provides options for users to compare samples with multiple control.

    Step 1:

    • • Select the input data type (Technical or biological replicate)
      • Provide the number of control samples and test samples
      • Choose the file for analysis and submit

  • Step 2: Drag and drop each of the replicates in the corresponding containers

    • • Click on the next button after grouping the samples
  • In case of biological replicates


    Here all the samples in the test sample are compared with 126 control and 127 sample.

    In case of technical replicates


    • • After normalization of the data, the user can choose the control for further analysis (ProteoArk will use only one control for comparison at a time)

      • Select the control and submit for further analysis
      • The following page shows the results of the differential expression analysis of samples with the chosen control.
      • After the analysis, the user can reset the control (second analysis) and can perform differential expression analysis with the second control. The reset option is provided on the result page itself.
  • >


    Data visualization


    This section includes 15 different kinds of charts for visualization. Each plot needs the input file in a specific format. The user can use the example file as reference for formatting the input data

    Step 1: Choose the plot for your data


    Step 2: Upload the input file in CSV format and submit


    Step 3: Choose the appropriate fields for plotting

    • • There are fields that accept only numerical or character values. Wrong selection of fields will result error message or incorrect graphs

  • ProteoArk’s ‘Data visualization’ accepts not only proteomic but also all kinds of data which is formatted as per instruction for plotting