Welcome to the Galaxy Machine Learning workbench

ML Galaxy

The Galaxy Machine Learning workbench is a comprehensive set of data preprocessing, machine learning, deep learning and visualisation tools, consolidated workflows for end-to-end machine learning analysis and training materials to showcase the usage of these tools. The workbench is available on the Galaxy framework, which guarantees simple access, easy extension, flexible adaption to personal and security needs, and sophisticated machine learning analyses independent of command-line knowledge.

The workbench provides you with a Swiss Army knife of scikit-learn, Keras (a deep learning library based on TensorFlow) and various other tools to transform, learn and predict and plot your data.

The workbench is currently developed by the Goecks Lab and the European Galaxy project. The German Network for Bioinformatics Infrastructure (de.NBI), which runs the German ELIXIR Node, provides the necessary compute clusters with CPUs and GPU resources.

The project is a community effort, please jump in, ask questions, and contribute to the development of new tools, workflows or trainings!

Content

  1. Get started
  2. Training
  3. Available tools
    1. Classification
    2. Regression
    3. Clustering
    4. Model building
    5. Model evaluation
    6. Preprocessing and feature selection
    7. Deep learning
    8. Visualization
    9. Utilities
    10. Interactive Environments
  4. Contributors

Get started

Are you new to Galaxy, or returning after a long time, and looking for help to get started? Take a guided tour through Galaxy’s user interface.

Training

We are passionate about training. So we are working in close collaboration with the Galaxy Training Network (GTN) to develop training materials of data analyses based on Galaxy (Batut et al., 2017). These materials hosted on the GTN GitHub repository are available online at https://training.galaxyproject.org.

Want to learn more about machine learning? Take one of our guided tours or check out the following hands-on tutorials, developed together with the GTN community.

Lesson Slides Hands-on Input dataset Workflows Galaxy tour Galaxy History
Basics of machine learning  
Classification      
Regression      
Age prediction using machine learning    
Clustering      
Introduction to deep learning        

Available tools

In this section we list the most important tools that have been integrated into the Machine Learning workbench. There are many more tools available so please have a more detailed look at the tool panel. For better readability, we have divided them into categories.

Classification

Identifying which category an object belongs to.

Tool Description Reference
sklearn_svm_classifier Support vector machines (SVMs) for classification Pedregosa et al. 2011
sklearn_nn_classifier Nearest Neighbors Classification Pedregosa et al. 2011
sklearn_ensemble Ensemble methods for classification and regression Pedregosa et al. 2011
sklearn_discriminant_classifier Linear and Quadratic Discriminant Analysis Pedregosa et al. 2011
sklearn_generalized_linear Generalized linear models for classification and regression Pedregosa et al. 2011
sklearn_clf_metrics Calculate metrics for classification performance Pedregosa et al. 2011

Regression

Predicting a continuous-valued attribute associated with an object.

Tool Description Reference
sklearn_ensemble Ensemble methods for classification and regression Pedregosa et al. 2011
sklearn_generalized_linear Generalized linear models for classification and regression Pedregosa et al. 2011
sklearn_regression_metrics Calculate metrics for regression performance Pedregosa et al. 2011

Clustering

Automatic grouping of similar objects into sets.

Tool Description Reference
sklearn_numeric_clustering Different numerical clustering algorithms Pedregosa et al. 2011

Model building

Building general machine learning models.

Tool Description Reference
sklearn_estimator_attributes Estimator attributes to get all attributes from an estimator or scikit object Pedregosa et al. 2011
sklearn_stacking_ensemble_models Stacking Ensembles to build stacking, voting ensemble models with numerous base options Pedregosa et al. 2011
sklearn_searchcv Hyperparameter Search performs hyperparameter optimization using various SearchCVs Pedregosa et al. 2011
sklearn_build_pipeline Pipeline Builder as an all-in-one platform to build pipeline, single estimator, preprocessor and custom wrappers Pedregosa et al. 2011

Model evaluation

Evaluation, validating and choosing parameters and models.

Tool Description Reference
sklearn_model_validation Model Validation includes cross_validate, cross_val_predict, learning_curve, and more Pedregosa et al. 2011
sklearn_pairwise_metrics Evaluate pairwise distances or compute affinity or kernel for sets of samples Pedregosa et al. 2011
sklearn_train_test_eval Train, Test and Evaluation to fit a model using part of dataset and evaluate using the rest Pedregosa et al. 2011
model_prediction Model Prediction predicts on new data using a preffited model Chollet et al. 2011
sklearn_fitted_model_eval Evaluate a Fitted Model using a new batch of labeled data Pedregosa et al. 2011
sklearn_model_fit Fit a Pipeline, Ensemble or other models using a labeled dataset Pedregosa et al. 2011

Preprocessing and feature selection

Feature selection and preprocessing.

Tool Description Reference
sklearn_data_preprocess Preprocess raw feature vectors into standardized datasets Pedregosa et al. 2011
sklearn_feature_selection Feature Selection module, including univariate filter selection methods and recursive feature elimination algorithm Pedregosa et al. 2011

Deep learning

Build and use deep neural networks.

Tool Description Reference
keras_batch_models Build Deep learning Batch Training Models with online data generator for Genomic/Protein sequences and images Chollet et al. 2011
keras_model_builder Create deep learning model with an optimizer, loss function and fit parameters Chollet et al. 2011
keras_model_config Create a deep learning model architecture using Keras Chollet et al. 2011
keras_train_and_eval Deep learning training and evaluation either implicitly or explicitly Chollet et al. 2011

Visualization

Plotting and visualization.

Tool Description Reference
plotly_regression_performance_plots Plot actual vs predicted curves and residual plots of tabular data  
plotly_ml_performance_plots Plot confusion matrix, precision, recall and ROC and AUC curves of tabular data  
ml_visualization_ex Machine Learning Visualization Extension includes several types of plotting for machine learning Chollet et al. 2011

Utilities

General data and table manipulation tools.

Tool Description Reference
table_compute The power of the pandas data library for manipulating and computing expressions upon tabular data and matrices.  
datamash_ops Datamash operations on tabular data  
datamash_transpose Transpose rows/columns in a tabular file  
sklearn_sample_generator Generate random samples with controlled size and complexity Pedregosa et al. 2011
sklearn_train_test_split Split Dataset into training and test subsets Pedregosa et al. 2011

Interactive Environments

You have done the heavy lifting and now want to use your coding skills inside Jupyter or RStudio? Work on data with the following:

Tool Description Reference
Jupyter Jupyter lab  
RStudio RStudio  

Contributors

Our Data Policy

Registered UsersUnregistered UsersFTP DataGDPR Compliance
User data on UseGalaxy.eu (i.e. datasets, histories) will be available as long as they are not deleted by the user. Once marked as deleted the datasets will be permanently removed within 14 days. If the user "purges" the dataset in the Galaxy, it will be removed immediately, permanently. An extended quota can be requested for a limited time period in special cases. Processed data will only be accessible during one browser session, using a cookie to identify your data. This cookie is not used for any other purposes (e.g. tracking or analytics). If UseGalaxy.eu service is not accessed for 90 days, those datasets will be permanently deleted. Any user data uploaded to our FTP server should be imported into Galaxy as soon as possible. Data left in FTP folders for more than 3 months, will be deleted. The Galaxy service complies with the EU General Data Protection Regulation (GDPR). You can read more about this on our Terms and Conditions.