Tutorial

Bifrost

Bifrost is a tool made by the core development team to let users share variables between the Python Runtime and the NodeJS Runtime. Bifrost also allows you to run JavaScript code in the NodeJS Runtime.

Why Use Bifrost? What Does it Do?

DCP is built with JavaScript, and tasks must be written in JavaScript or compiled to WebAssembly. However, the core development team realizes people may be using Python for pipelining their data or for pre/postprocessing.

With this in mind, Bifrost lets developers use the primitive variables from Python (as well as NumPy arrays) with DCP. You as the developer must still write your programs in JavaScript if you want to distribute them, but you can use your shared Python variables without any modifications. Similarly, the variables you receive back from a program completed by DCP will be in JavaScript. Bifrost converts these back into Python, and allows you to perform visualizations, post-processing, or to return them to a database.

So, even though you cannot (yet!) run Python code on DCP, you can quickly and easily shift the variables to JavaScript when there is a compute-heavy, parallelizable step.

How Do I Use Bifrost?

To use Bifrost, import it into your project with the following code:

%%capture

# Downloads and installs a customized NodeJS backend for this notebook
# This may take a few moments when first executed in a new runtime

!npm install -g n && n 10.20.1
!pip install git+https://github.com/Kings-Distributed-Systems/Bifrost

from bifrost import npm, node

When this is done, all shareable variables in Python can be used in the NodeJS Runtime.

You can use this in either a computational notebook by tagging a cell with the cell magic %%node, or in any python environment by using the following template:

resulting_dictionary_of_variables = node.run( js code as string, dictionary mapping of variables to values );

Where Bifrost Can Be Used

At present, you can launch programs that use DCP from three locations:

  1. A local Node.js instance
  2. Vanilla Web
  3. Any Bifrost enabled runtime (currently only Python)

Examples of Bifrost

The best way to learn what Bifrost does is to see it in action!

Below is an example ML project that uses Python variables and converts them into JavaScript with Bifrost. It's been taken directly from a Google Colab example performs other DCP operations, but we're working on some more step-by-step documentation directly for Bifrost itself.

Using DCP for TensorFlow

This notebook presents a small machine learning project using TensorFlow, the MNIST dataset, and the Distributed Compute Protocol (DCP). In it, an external server cluster is being accessed with just a few commands.

Setting up your Node.js environment

The first step in using DCP is to install the appropriate version of Node.js, along with the required repositories and libraries. Computational tasks can be deployed to DCP from either Node.js or the browser environment.

The interface between Python and Node.js inside of the notebook is a customized tool called Bifrost. This tool allows developers to use primitive variables from Python and NumPy Arrays in their JavaScript DCP applications.

%%capture

# Downloads and installs a customized NodeJS backend for this notebook
# This may take a few moments when first executed in a new runtime

!npm install -g n && n 10.20.1
!pip install git+https://github.com/Kings-Distributed-Systems/Bifrost

from bifrost import npm, node

Configuring DCP-Client

With the Bifrost interface in place, we can use "npm.install" to install Node.js packages into our Colab environment.

In our case, we want to install DCP-Client with the following single line of code:

%%capture

# Installs DCP-Client using the notebook's Node.js backend

npm.install('dcp-client')

The next step is to connect this Colab notebook with a remote server cluster, so that it can execute in parallel across many nodes. Access to this cluster is granted through the client we just downloaded.

Using this client requires an API Key called a 'keystore'. This should have been provided to you alongside the URL to this Colab. If you do not have a keystore, please email us at info@kingsds.network.

# Loads the ID used to deploy and pay for jobs on DCP

# When prompted, please upload the keystore file that you were provided with

from google.colab import files

KEYSTORE_NAME = list(files.upload().keys())[0]
!mkdir -p ~/.dcp && cp /content/$KEYSTORE_NAME ~/.dcp/id.keystore && cp ~/.dcp/id.keystore ~/.dcp/default.keystore

Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.

Saving colab.keystore to colab.keystore
%%node

// # Points DCP to your uploaded ID, and initializes the client

require('dcp-client').initSync();

const compute = require('dcp/compute');
const dcpCli = require('dcp/dcp-cli');

Hyperparameter tuning MNIST models over DCP

The next step is to proceed with finding an optimal set of hyperparameters for an MNIST character-recognition model that we will be training. This is our first compute 'workload'.

This workload is divided into discrete units of compute called slices, each representing a neural network model to be trained with a different set of hyperparameters. These are processed in parallel through DCP, on servers connected to this particular cluster. As slices are computed, the results are returned to this Colab notebook.

# Declare the functions that we will use to create a population of random hyperparameter sets

import math, random

def create_generation(population_size, possible_parameters):

    def random_parameter_from_key(my_key):
        random_index = math.floor(random.random() * len(possible_parameters[my_key]))
        return possible_parameters[my_key][random_index]

    new_population = []

    for x in range(population_size):
        new_member = {}
        for key in possible_parameters:
            new_member[key] = random_parameter_from_key(key)
        new_population.append(new_member)

    return new_population
%%node

// # Define the hyperparameter tuning function that will be sent out to our workers across the DCP network

tuningFunction = `async function(modelParameters) {

  md = require('mnist');
  tf = require('tfjs');
  tf.setBackend('cpu');

  progress(0);

  // # Construct model based on set of hyperparameters provided to this worker

  let myModel = tf.sequential();
  myModel.add(tf.layers.flatten({inputShape: [28, 28, 1]}));
  for (let i = 0; i < modelParameters.num_layers; i++){
      myModel.add(tf.layers.dense({units: modelParameters.num_units, activation: modelParameters.activation}));
  }
  myModel.add(tf.layers.dense({units: 10, activation: 'softmax'}));
  myModel.compile({optimizer: tf.train[modelParameters.optimizer.toLowerCase()](modelParameters.lr), loss: 'categoricalCrossentropy', metrics: ['accuracy']});

  progress();

  // # Process MNIST character recognition data for training

  let myData = await md.load();

  //let myImages = new Float32Array(myData.images);
  //myImages = await myImages.map(x => x / 255.0);

  progress();

  let labelsTensor = await tf.tensor2d(myData.labels, [myData.labels.length / 10, 10]);
  let imagesTensor = await tf.tensor4d(myData.images, [myData.images.length / 784, 28, 28, 1]);

  progress();

  // # Train our model on MNIST data set, tracking loss and accuracy

  let myLoss;
  let myAccuracy;

  await myModel.fit(imagesTensor, labelsTensor, {
    batchSize: 100,
    epochs: 3,
    validationSplit: 0.15,
    callbacks: {onBatchEnd: async (batch, logs) => {
      progress();
    }, onEpochEnd: async (epoch, logs) => {
      myLoss = logs.val_loss;
      myAccuracy = logs.val_acc;
    }}
  });
  tf.dispose(myModel);

  progress(1.0);

  // # Return model hyperparameters, along with final loss and accuracy after training and validation

  return { parameters: modelParameters, loss: myLoss, accuracy: myAccuracy };
}`;

// # Declare the function that will deploy our hyperparameter tuning job to the DCP network

async function postJob(parameterSet, myMaxRuntime) {

    let myKeystore = await dcpCli.getAccountKeystore();

    const job = compute.for(parameterSet, tuningFunction);

    let myTimer = setTimeout(function(){
        job.cancel();
        console.log('Job reached ' + myMaxRuntime + ' minutes.');
    }, myMaxRuntime * 60 * 1000);

    job.public.name = 'DCP Colab Notebook - Hyperparameter Tuning';
    job.requires(['aistensorflow/tfjs', 'aitf-mnist-data/mnist']);

    job.on('accepted', () => {
        console.log('Job accepted: ' + job.id);
    });
    job.on('status', (status) => {
        console.log('STATUS:');
        console.log(
            status.total + ' slices posted, ' +
            status.distributed + ' slices distributed, ' +
            status.computed + ' slices computed.'
        );
    });
    job.on('result', (thisOutput) => {
        console.log('RESULT:');
        console.log(thisOutput.result);
        if (thisOutput.result.accuracy > tuning_best_result.accuracy) tuning_best_result = thisOutput.result;
    });

    try {
        await job.exec(compute.marketValue, myKeystore);
    } catch (myError) {
        console.log('Job halted.');
    }

    clearTimeout(myTimer);

    return(tuning_best_result);
}

Here we are setting up a population of different random sets of hyperparameters to be tested. You can make the number of sets in the population higher or lower by adjusting the variable population_size; there are thousands of possible combinations in the hyperparameter space that we've defined. The number you enter will be the number of sets that get packaged into slices, to be computed by nodes on the DCP network.

Additionally, we've set a maximum runtime for the hyperparameter tuning job with the variable tuning_max_runtime. Some slices will take longer than others, so here we can decide the longest we're willing to wait to get back all of our results. Once this many minutes have elapsed, the job will be stopped if computation is still ongoing; we will still have all of our results from the slices that were successfully completed before that point, and can then proceed with the best-performing set of hyperparameters that we were given.

#@markdown ### Tuning Job Parameters

#@markdown Number of random hyperparameter sets for training:
population_size = 50 #@param {type:"slider", min:10, max:100, step:10}

#@markdown Maximum runtime allowed for the job, in minutes:
tuning_max_runtime = 15 #@param {type:"slider", min:5, max:30, step:5}

# Define the range of possible hyperparameters that we will be using for our models

parameter_space = {
    "activation": ['linear','relu','selu','sigmoid','softmax', 'tanh'],
    "optimizer": ['SGD','Adagrad','Adadelta','Adam','Adamax','RMSprop'],
    "num_layers": [1, 2, 3, 4, 5, 6],
    "num_units": [1, 2, 4, 8, 16, 32],
    "lr": [1, 0.1, 0.01, 0.001, 0.0001, 0.00001],
}

# Ongoing tracker of best performing set
tuning_best_result = { "accuracy": 0 }

# Generate a set of model hyperparameters
tuning_parameters = create_generation(population_size, parameter_space)
%%node

// # Call functions to generate a set of model hyperparameters, and deploy them to DCP for training in parallel

deployTime = Date.now();

postJob(tuning_parameters, tuning_max_runtime).then((value) => {
    console.log('Job complete.');

    console.log('Best accuracy found:');
    console.log(tuning_best_result.accuracy);

    let finalTime = Date.now() - deployTime;

    console.log('Total time to compute:')
    console.log((finalTime / 1000).toFixed(2) + ' seconds.')
});
Job accepted: golx5xRGNlTolpxu6eAHOk
STATUS:
50 slices posted, 0 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 1 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 2 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 3 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 4 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 5 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 6 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 7 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 8 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 9 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 10 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 11 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 12 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 13 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 14 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 15 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 16 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 17 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 18 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 19 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 20 slices distributed, 0 slices computed.
RESULT:
{ parameters:
{ activation: 'selu',
optimizer: 'RMSprop',
num_layers: 5,
num_units: 8,
lr: 0.001 },
loss: 0.3785330653190613,
accuracy: 0.8913846015930176 }
STATUS:
50 slices posted, 20 slices distributed, 1 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'Adam',
num_layers: 3,
num_units: 1,
lr: 1 },
loss: 2.4013476371765137,
accuracy: 0.10123077034950256 }
STATUS:
50 slices posted, 20 slices distributed, 2 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'SGD',
num_layers: 5,
num_units: 8,
lr: 0.0001 },
loss: 2.3004343509674072,
accuracy: 0.07251282036304474 }
STATUS:
50 slices posted, 20 slices distributed, 3 slices computed.
RESULT:
{ parameters:
{ activation: 'selu',
optimizer: 'Adagrad',
num_layers: 4,
num_units: 8,
lr: 0.00001 },
loss: 2.3340208530426025,
accuracy: 0.0923076942563057 }
STATUS:
50 slices posted, 20 slices distributed, 4 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'SGD',
num_layers: 1,
num_units: 1,
lr: 0.001 },
loss: 2.2606990337371826,
accuracy: 0.19200000166893005 }
STATUS:
50 slices posted, 20 slices distributed, 5 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'SGD',
num_layers: 5,
num_units: 8,
lr: 0.0001 },
loss: 2.3452725410461426,
accuracy: 0.09856410324573517 }
STATUS:
50 slices posted, 20 slices distributed, 6 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'Adadelta',
num_layers: 1,
num_units: 16,
lr: 1 },
loss: 0.7750394344329834,
accuracy: 0.7326154112815857 }
STATUS:
50 slices posted, 20 slices distributed, 7 slices computed.
STATUS:
50 slices posted, 21 slices distributed, 7 slices computed.
RESULT:
{ parameters:
{ activation: 'tanh',
optimizer: 'Adagrad',
num_layers: 5,
num_units: 1,
lr: 0.00001 },
loss: 2.3025763034820557,
accuracy: 0.11292307823896408 }
STATUS:
50 slices posted, 21 slices distributed, 8 slices computed.
STATUS:
50 slices posted, 22 slices distributed, 8 slices computed.
STATUS:
50 slices posted, 23 slices distributed, 8 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'Adamax',
num_layers: 6,
num_units: 4,
lr: 0.00001 },
loss: 2.3019137382507324,
accuracy: 0.13558974862098694 }
STATUS:
50 slices posted, 23 slices distributed, 9 slices computed.
STATUS:
50 slices posted, 24 slices distributed, 9 slices computed.
STATUS:
50 slices posted, 25 slices distributed, 9 slices computed.
STATUS:
50 slices posted, 26 slices distributed, 9 slices computed.
RESULT:
{ parameters:
{ activation: 'softmax',
optimizer: 'SGD',
num_layers: 2,
num_units: 4,
lr: 0.001 },
loss: 2.303086042404175,
accuracy: 0.11292307823896408 }
STATUS:
50 slices posted, 26 slices distributed, 10 slices computed.
STATUS:
50 slices posted, 27 slices distributed, 10 slices computed.
STATUS:
50 slices posted, 28 slices distributed, 10 slices computed.
RESULT:
{ parameters:
{ activation: 'linear',
optimizer: 'RMSprop',
num_layers: 1,
num_units: 1,
lr: 0.001 },
loss: 1.7201460599899292,
accuracy: 0.3432820439338684 }
STATUS:
50 slices posted, 28 slices distributed, 11 slices computed.
STATUS:
50 slices posted, 29 slices distributed, 11 slices computed.
STATUS:
50 slices posted, 30 slices distributed, 11 slices computed.
STATUS:
50 slices posted, 31 slices distributed, 11 slices computed.
STATUS:
50 slices posted, 32 slices distributed, 11 slices computed.
STATUS:
50 slices posted, 33 slices distributed, 11 slices computed.
STATUS:
50 slices posted, 34 slices distributed, 11 slices computed.
STATUS:
50 slices posted, 35 slices distributed, 11 slices computed.
STATUS:
50 slices posted, 36 slices distributed, 11 slices computed.
STATUS:
50 slices posted, 37 slices distributed, 11 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'SGD',
num_layers: 4,
num_units: 2,
lr: 0.0001 },
loss: 2.3416709899902344,
accuracy: 0.10338461399078369 }
STATUS:
50 slices posted, 37 slices distributed, 12 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'RMSprop',
num_layers: 1,
num_units: 1,
lr: 0.001 },
loss: 1.941662311553955,
accuracy: 0.21015384793281555 }
STATUS:
50 slices posted, 37 slices distributed, 13 slices computed.
STATUS:
50 slices posted, 38 slices distributed, 13 slices computed.
STATUS:
50 slices posted, 39 slices distributed, 13 slices computed.
STATUS:
50 slices posted, 40 slices distributed, 13 slices computed.
RESULT:
{ parameters:
{ activation: 'selu',
optimizer: 'RMSprop',
num_layers: 6,
num_units: 16,
lr: 0.01 },
loss: 0.22380386292934418,
accuracy: 0.9342564344406128 }
STATUS:
50 slices posted, 40 slices distributed, 14 slices computed.
STATUS:
50 slices posted, 41 slices distributed, 14 slices computed.
STATUS:
50 slices posted, 42 slices distributed, 14 slices computed.
RESULT:
{ parameters:
{ activation: 'softmax',
optimizer: 'Adamax',
num_layers: 6,
num_units: 1,
lr: 0.01 },
loss: 2.3013417720794678,
accuracy: 0.11292307823896408 }
STATUS:
50 slices posted, 42 slices distributed, 15 slices computed.
STATUS:
50 slices posted, 43 slices distributed, 15 slices computed.
RESULT:
{ parameters:
{ activation: 'tanh',
optimizer: 'Adamax',
num_layers: 6,
num_units: 1,
lr: 0.001 },
loss: 1.9312543869018555,
accuracy: 0.20810256898403168 }
STATUS:
50 slices posted, 43 slices distributed, 16 slices computed.
STATUS:
50 slices posted, 44 slices distributed, 16 slices computed.
STATUS:
50 slices posted, 45 slices distributed, 16 slices computed.
STATUS:
50 slices posted, 46 slices distributed, 16 slices computed.
STATUS:
50 slices posted, 47 slices distributed, 16 slices computed.
RESULT:
{ parameters:
{ activation: 'tanh',
optimizer: 'Adagrad',
num_layers: 6,
num_units: 1,
lr: 0.01 },
loss: 1.9832453727722168,
accuracy: 0.19097435474395752 }
STATUS:
50 slices posted, 47 slices distributed, 17 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'Adam',
num_layers: 1,
num_units: 32,
lr: 0.001 },
loss: 0.1816585212945938,
accuracy: 0.9480000138282776 }
STATUS:
50 slices posted, 47 slices distributed, 18 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'SGD',
num_layers: 4,
num_units: 8,
lr: 0.1 },
loss: 2.3019304275512695,
accuracy: 0.11292307823896408 }
STATUS:
50 slices posted, 47 slices distributed, 19 slices computed.
RESULT:
{ parameters:
{ activation: 'tanh',
optimizer: 'Adamax',
num_layers: 3,
num_units: 1,
lr: 0.01 },
loss: 1.8381167650222778,
accuracy: 0.23887179791927338 }
STATUS:
50 slices posted, 48 slices distributed, 19 slices computed.
STATUS:
50 slices posted, 48 slices distributed, 20 slices computed.
STATUS:
50 slices posted, 49 slices distributed, 20 slices computed.
STATUS:
50 slices posted, 50 slices distributed, 20 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'Adadelta',
num_layers: 5,
num_units: 16,
lr: 0.0001 },
loss: 2.3923656940460205,
accuracy: 0.09600000083446503 }
STATUS:
50 slices posted, 50 slices distributed, 21 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'Adadelta',
num_layers: 6,
num_units: 4,
lr: 0.00001 },
loss: 2.302671432495117,
accuracy: 0.10451281815767288 }
STATUS:
50 slices posted, 50 slices distributed, 22 slices computed.
RESULT:
{ parameters:
{ activation: 'tanh',
optimizer: 'RMSprop',
num_layers: 2,
num_units: 4,
lr: 1 },
loss: 11.533044815063477,
accuracy: 0.1007179468870163 }
STATUS:
50 slices posted, 50 slices distributed, 23 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'Adagrad',
num_layers: 5,
num_units: 1,
lr: 0.0001 },
loss: 2.303934097290039,
accuracy: 0.1007179468870163 }
STATUS:
50 slices posted, 50 slices distributed, 24 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'Adam',
num_layers: 1,
num_units: 1,
lr: 0.0001 },
loss: 2.2228894233703613,
accuracy: 0.1890256404876709 }
STATUS:
50 slices posted, 50 slices distributed, 25 slices computed.
RESULT:
{ parameters:
{ activation: 'selu',
optimizer: 'RMSprop',
num_layers: 6,
num_units: 16,
lr: 0.1 },
loss: 14.297988891601562,
accuracy: 0.11292307823896408 }
STATUS:
50 slices posted, 50 slices distributed, 26 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'Adagrad',
num_layers: 4,
num_units: 32,
lr: 1 },
loss: 2.3031649589538574,
accuracy: 0.11292307823896408 }
STATUS:
50 slices posted, 50 slices distributed, 27 slices computed.
RESULT:
{ parameters:
{ activation: 'softmax',
optimizer: 'Adamax',
num_layers: 3,
num_units: 1,
lr: 0.0001 },
loss: 2.328644037246704,
accuracy: 0.10246153920888901 }
STATUS:
50 slices posted, 50 slices distributed, 28 slices computed.
RESULT:
{ parameters:
{ activation: 'selu',
optimizer: 'Adamax',
num_layers: 6,
num_units: 16,
lr: 0.1 },
loss: 0.32777613401412964,
accuracy: 0.91661536693573 }
STATUS:
50 slices posted, 50 slices distributed, 29 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'SGD',
num_layers: 1,
num_units: 4,
lr: 0.1 },
loss: 0.5345848798751831,
accuracy: 0.8333333134651184 }
STATUS:
50 slices posted, 50 slices distributed, 30 slices computed.
RESULT:
{ parameters:
{ activation: 'linear',
optimizer: 'Adamax',
num_layers: 1,
num_units: 2,
lr: 0.0001 },
loss: 1.9837646484375,
accuracy: 0.3029743731021881 }
STATUS:
50 slices posted, 50 slices distributed, 31 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'RMSprop',
num_layers: 2,
num_units: 4,
lr: 1 },
loss: 2.6522903442382812,
accuracy: 0.09600000083446503 }
STATUS:
50 slices posted, 50 slices distributed, 32 slices computed.
RESULT:
{ parameters:
{ activation: 'softmax',
optimizer: 'RMSprop',
num_layers: 4,
num_units: 4,
lr: 1 },
loss: 3.0329971313476562,
accuracy: 0.10338461399078369 }
STATUS:
50 slices posted, 50 slices distributed, 33 slices computed.
RESULT:
{ parameters:
{ activation: 'linear',
optimizer: 'RMSprop',
num_layers: 5,
num_units: 16,
lr: 0.0001 },
loss: 0.571458637714386,
accuracy: 0.836717963218689 }
STATUS:
50 slices posted, 50 slices distributed, 34 slices computed.
RESULT:
{ parameters:
{ activation: 'selu',
optimizer: 'Adamax',
num_layers: 6,
num_units: 4,
lr: 0.001 },
loss: 1.318507432937622,
accuracy: 0.5468717813491821 }
STATUS:
50 slices posted, 50 slices distributed, 35 slices computed.
RESULT:
{ parameters:
{ activation: 'linear',
optimizer: 'Adadelta',
num_layers: 6,
num_units: 16,
lr: 1 },
loss: 14.297988891601562,
accuracy: 0.11292307823896408 }
STATUS:
50 slices posted, 50 slices distributed, 36 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'Adadelta',
num_layers: 1,
num_units: 32,
lr: 0.01 },
loss: 1.288211464881897,
accuracy: 0.7542564272880554 }
STATUS:
50 slices posted, 50 slices distributed, 37 slices computed.
RESULT:
{ parameters:
{ activation: 'selu',
optimizer: 'SGD',
num_layers: 2,
num_units: 2,
lr: 0.01 },
loss: 1.625891089439392,
accuracy: 0.32256409525871277 }
STATUS:
50 slices posted, 50 slices distributed, 38 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'Adagrad',
num_layers: 5,
num_units: 32,
lr: 0.00001 },
loss: 2.3079874515533447,
accuracy: 0.07846153527498245 }
STATUS:
50 slices posted, 50 slices distributed, 39 slices computed.
RESULT:
{ parameters:
{ activation: 'softmax',
optimizer: 'Adamax',
num_layers: 6,
num_units: 4,
lr: 1 },
loss: 2.3245961666107178,
accuracy: 0.08892307430505753 }
STATUS:
50 slices posted, 50 slices distributed, 40 slices computed.
RESULT:
{ parameters:
{ activation: 'tanh',
optimizer: 'Adam',
num_layers: 4,
num_units: 16,
lr: 0.00001 },
loss: 2.000369071960449,
accuracy: 0.4795897305011749 }
STATUS:
50 slices posted, 50 slices distributed, 41 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'Adagrad',
num_layers: 5,
num_units: 8,
lr: 0.00001 },
loss: 2.3882391452789307,
accuracy: 0.08892307430505753 }
STATUS:
50 slices posted, 50 slices distributed, 42 slices computed.
RESULT:
{ parameters:
{ activation: 'linear',
optimizer: 'Adagrad',
num_layers: 5,
num_units: 8,
lr: 0.01 },
loss: 0.4534589946269989,
accuracy: 0.8678974509239197 }
STATUS:
50 slices posted, 50 slices distributed, 43 slices computed.
RESULT:
{ parameters:
{ activation: 'tanh',
optimizer: 'RMSprop',
num_layers: 4,
num_units: 8,
lr: 0.01 },
loss: 0.3614780902862549,
accuracy: 0.9039999842643738 }
STATUS:
50 slices posted, 50 slices distributed, 44 slices computed.
RESULT:
{ parameters:
{ activation: 'linear',
optimizer: 'Adadelta',
num_layers: 2,
num_units: 16,
lr: 0.0001 },
loss: 2.2180862426757812,
accuracy: 0.1653333306312561 }
STATUS:
50 slices posted, 50 slices distributed, 45 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'Adamax',
num_layers: 1,
num_units: 8,
lr: 0.001 },
loss: 0.9384765625,
accuracy: 0.8393846154212952 }
STATUS:
50 slices posted, 50 slices distributed, 46 slices computed.
RESULT:
{ parameters:
{ activation: 'linear',
optimizer: 'RMSprop',
num_layers: 2,
num_units: 32,
lr: 0.001 },
loss: 0.2835576832294464,
accuracy: 0.9228717684745789 }
STATUS:
50 slices posted, 50 slices distributed, 47 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'Adamax',
num_layers: 5,
num_units: 8,
lr: 0.0001 },
loss: 2.0433709621429443,
accuracy: 0.2981538474559784 }
STATUS:
50 slices posted, 50 slices distributed, 48 slices computed.
RESULT:
{ parameters:
{ activation: 'tanh',
optimizer: 'Adagrad',
num_layers: 6,
num_units: 32,
lr: 0.00001 },
loss: 2.2852320671081543,
accuracy: 0.12092307955026627 }
STATUS:
50 slices posted, 50 slices distributed, 49 slices computed.
RESULT:
{ parameters:
{ activation: 'softmax',
optimizer: 'RMSprop',
num_layers: 6,
num_units: 32,
lr: 0.0001 },
loss: 2.301220417022705,
accuracy: 0.11292307823896408 }
STATUS:
50 slices posted, 50 slices distributed, 50 slices computed.
Job complete.
Best accuracy found:
0.9480000138282776
Total time to compute:
595.79 seconds.

Describing the Console Output


What you see are discrete units of 'compute' that are being distributed (data & methods being transmitted to worker nodes), and computed. In this case, they are our hyperparameter searches. As the results are returned, they are displayed in the console.

Build a TensorFlow model from the best-performing hyperparameters

Now, we load the MNIST dataset we will train our model with using TensorFlow. This training uses the most accurate hyperparameter set found in our previous step.

# Load MNIST and character recognition dataset in notebook

import tensorflow as tf

(_x_train, _y_train),(_x_test, _y_test) = tf.keras.datasets.mnist.load_data()

# Normalize image pixel data between 0 and 1, and convert labels to one-hot format

_x_train = _x_train / 255.0
_y_train = tf.keras.utils.to_categorical(_y_train, num_classes = 10)

# Construct and train model in notebook, using best-performing hyperparameters from above

_optimizer_name = getattr(tf.keras.optimizers, tuning_best_result['parameters']['optimizer'])
_model_optimizer = _optimizer_name(learning_rate = tuning_best_result['parameters']['lr'])

_model = tf.keras.models.Sequential()

_model.add(tf.keras.layers.Flatten(input_shape = (28,28,1)))

for x in range(tuning_best_result['parameters']['num_layers']):
  _model.add(tf.keras.layers.Dense(tuning_best_result['parameters']['num_units'], activation = tuning_best_result['parameters']['activation']))

_model.add(tf.keras.layers.Dense(10, activation = 'softmax'))

_model.compile(
    optimizer = _model_optimizer,
    loss = 'categorical_crossentropy',
    metrics= ['accuracy'])

_model.fit(_x_train, _y_train, epochs = 3, validation_split = 0.15, batch_size = 100)
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step
Epoch 1/3
510/510 [==============================] - 1s 3ms/step - loss: 0.5003 - accuracy: 0.8644 - val_loss: 0.2389 - val_accuracy: 0.9364
Epoch 2/3
510/510 [==============================] - 1s 2ms/step - loss: 0.2431 - accuracy: 0.9306 - val_loss: 0.1895 - val_accuracy: 0.9494
Epoch 3/3
510/510 [==============================] - 1s 2ms/step - loss: 0.2000 - accuracy: 0.9440 - val_loss: 0.1650 - val_accuracy: 0.9547





<tensorflow.python.keras.callbacks.History at 0x7fa7cdaa0dd8>

Saving your model with TensorFlow.js Converter

Now, we save the locally trained model to TensorFlow.js. This is how we can let our trained Python model perform inferencing in parallel on the DCP network.

%%capture

# Download and install model conversion tool

!pip install tensorflowjs==2.1.0
!git clone https://github.com/Kings-Distributed-Systems/tfjs_util.git
!cd tfjs_util && npm i && npm run postinstall
# Convert trained model from python to javacript

import tensorflowjs as tfjs
import random, string

tfjs.converters.save_keras_model(_model, './tfjs_model')

# Upload saved javascript model to the DCP network for inferencing

MODULE_NAME = 'colab-' + ''.join(random.choice(string.ascii_lowercase) for i in range(25))

!node /content/tfjs_util/bin/serializeModel.js -m ./tfjs_model/model.json -o $MODULE_NAME/model.js -p 0.0.1 -d
/usr/local/lib/python3.6/dist-packages/tensorflowjs/converters/keras_h5_conversion.py:123: H5pyDeprecationWarning: The default file mode will change to 'r' (read-only) in h5py 3.0. To suppress this warning, pass the mode you need to h5py.File(), or set the global default h5.get_config().default_file_mode, or set the environment variable H5PY_DEFAULT_READONLY=1. Available modes are: 'r', 'r+', 'w', 'w-'/'x', 'a'. See the docs for details.
  return h5py.File(h5file)


2020-11-12 18:27:48.582011: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-11-12 18:27:48.596924: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2020-11-12 18:27:48.597180: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x44dee00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-11-12 18:27:48.597218: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
Module published at :  colab-heauiezjsqvoagbvvstcldwao/model.js
Done!

Our last step is to use our model for inferencing with DCP.

As before, we are tapping into an external server cluster via the DCP network. This method lets us scale up our inferencing volume, without paying a significant sum or being locked-into a cloud operator like AWS.

%%node

// # Define the inferencing function that will be sent out to our workers across the DCP network

inferenceFunction = `async function(myData) {

  progress(0);

  tf = require('tfjs');
  tf.setBackend('cpu');

  // # Load our saved model into the worker

  let myModel = await require('model').getModel();

  progress();

  // # Convert testing data to an array and normalize between 0 and 1

  myData = await myData.split(',');
  myData = await myData.map(x => x / 255.0);

  progress();

  // # Convert normalized testing data to a tensor for prediction

  let imagesTensor = await tf.tensor4d(myData, [myData.length / 784, 28, 28, 1]);

  progress();

  // # Run saved model on the testing data tensor

  let predictResults = await tf.tidy(() => {

    const output = myModel.predict(imagesTensor);

    const axis = 1;
    const myPredictions = Array.from(output.argMax(axis).dataSync());

    return {predictions: myPredictions};
  });

  // # Release tensorflow memory, report work completion for this slice and return the predictions to the client notebook

  tf.dispose(myModel);

  progress(1.0);

  return predictResults;
}`;

// # Declare the function that will deploy our inferencing job to the DCP network

async function inferenceJob(inferenceData, inferenceLabels, myMaxRuntime) {

    let myKeystore = await dcpCli.getAccountKeystore();

    const job = compute.for(inferenceData, inferenceFunction);

    let myTimer = setTimeout(function(){
        job.cancel();
        console.log('Job reached ' + myMaxRuntime + ' minutes.');
    }, myMaxRuntime * 60 * 1000);

    job.public.name = 'DCP Colab Notebook - Saved Models';
    job.requires([`${MODULE_NAME}/model`, 'aistensorflow/tfjs'])

    job.on('accepted', () => {
        console.log('Job accepted: ' + job.id);
    });
    job.on('status', (status) => {
        console.log('STATUS:');
        console.log(
            status.total + ' slices posted, ' +
            status.distributed + ' slices distributed, ' +
            status.computed + ' slices computed.'
        );
    });
    job.on('result', (thisOutput) => {

        let sliceIndex = thisOutput.sliceNumber;
        let myPredictions = thisOutput.result.predictions;

        let correctCount = 0;
        for (let i = 0; i < myPredictions.length; i++) {
            if (myPredictions[i] == inferenceLabels[sliceIndex][i]) correctCount++;
        }
        console.log('RESULT:');
        console.log(correctCount + ' / ' + myPredictions.length + ' ( ' + ( correctCount / myPredictions.length * 100).toFixed(2) + '% )');

    });

    try {
        await job.exec(compute.marketValue, myKeystore);
    } catch (myError) {
        console.log('Job halted.');
    }

    clearTimeout(myTimer);

    return('\nJob complete.\n');
}
#@markdown ### Inference Job Parameters

#@markdown Desired number of parallel workers; the testing data will be divided into this many batches:
slice_count = 10 #@param {type:"slider", min:10, max:100, step:10}

#@markdown Maximum runtime allowed for the job, in minutes:
inference_max_runtime = 5 #@param {type:"slider", min:5, max:30, step:5}

# Make MNIST character recognition testing data loaded earlier available to the NodeJS context
xTest = _x_test
yTest = _y_test
%%node

// # Arrange testing data in batches of the number of images to be distributed to each worker

xSize = xTest.typedArray.length / slice_count;
ySize = yTest.typedArray.length / slice_count;

testingImages = [];
testingLabels = [];``

for (let i = 0; i < slice_count; i++) {
    testingImages.push(xTest.typedArray.slice(i * xSize, (i + 1) * xSize).toString());
    testingLabels.push(yTest.typedArray.slice(i * ySize, (i + 1) * ySize));
}
%%node

// # Calls functions to deploy the saved model and testing data to the DCP network for inferencing in parallel

inferenceJob(testingImages, testingLabels, inference_max_runtime).then((value) => {
    console.log(value);
});
Job accepted: el5Us9nrlrVrnw54nyN97R
STATUS:
10 slices posted, 0 slices distributed, 0 slices computed.
STATUS:
10 slices posted, 1 slices distributed, 0 slices computed.
STATUS:
10 slices posted, 2 slices distributed, 0 slices computed.
RESULT:
947 / 1000 ( 94.70% )
STATUS:
10 slices posted, 2 slices distributed, 1 slices computed.
RESULT:
921 / 1000 ( 92.10% )
STATUS:
10 slices posted, 2 slices distributed, 2 slices computed.
STATUS:
10 slices posted, 3 slices distributed, 2 slices computed.
RESULT:
940 / 1000 ( 94.00% )
STATUS:
10 slices posted, 3 slices distributed, 3 slices computed.
STATUS:
10 slices posted, 4 slices distributed, 3 slices computed.
STATUS:
10 slices posted, 5 slices distributed, 3 slices computed.
STATUS:
10 slices posted, 6 slices distributed, 3 slices computed.
STATUS:
10 slices posted, 7 slices distributed, 3 slices computed.
STATUS:
10 slices posted, 8 slices distributed, 3 slices computed.
RESULT:
963 / 1000 ( 96.30% )
STATUS:
10 slices posted, 8 slices distributed, 4 slices computed.
RESULT:
936 / 1000 ( 93.60% )
STATUS:
10 slices posted, 8 slices distributed, 5 slices computed.
RESULT:
977 / 1000 ( 97.70% )
STATUS:
10 slices posted, 8 slices distributed, 6 slices computed.
RESULT:
968 / 1000 ( 96.80% )
STATUS:
10 slices posted, 8 slices distributed, 7 slices computed.
STATUS:
10 slices posted, 9 slices distributed, 7 slices computed.
STATUS:
10 slices posted, 10 slices distributed, 7 slices computed.
RESULT:
922 / 1000 ( 92.20% )
STATUS:
10 slices posted, 10 slices distributed, 8 slices computed.
RESULT:
950 / 1000 ( 95.00% )
STATUS:
10 slices posted, 10 slices distributed, 9 slices computed.
RESULT:
976 / 1000 ( 97.60% )
STATUS:
10 slices posted, 10 slices distributed, 10 slices computed.
Job complete.

Describing the Console Output


What you see are discrete units of 'compute' that are being distributed (data & methods being transmitted to worker nodes), and computed. In this case, they are our inferencing examples. As the results are returned, they are displayed in the console.

This is just one example of DCP. These basic steps can be applied to accelerate other AI/ML frameworks like PyTorch and Keras , as well as non-AI applications. As in this example, there is no requirement to manually provision or orchestrate compute resources.

The cluster that powers this example was drawn from distributed, cloud-based servers. At scale, the cost of these is ~80% less than the equivalent from a public cloud like AWS or Microsoft Azure. DCP also lets developers build with private, internal networks made of underutilized machines. Private networks result in cost savings of 95% or more.

To learn more and get in touch with our engineers, please email info@kingsds.network.

Future Extensions on Bifrost

At the moment, Bifrost supports the Javascript <--> Python variable syncing of basic Python variables and NumPy arrays. In the near future as of March 2021, the core development team will add additional functionality which may include support for other programming languages.