Bifrost

Bifrost is a tool made by the core development team to let users share variables between the Python Runtime and the Node.js Runtime.

Bifrost also allows you to run JavaScript code in the Node.js Runtime.

Why use Bifrost

JavaScript is the primary language used to build DCP, and developers must write tasks in JavaScript or compile tasks to WebAssembly. Still, the core development team realizes people may be using Python for pipe lining their data or for processing.

With this in mind, Bifrost lets developers use the primitive variables from Python (as well as NumPy arrays) with DCP. You as the developer must still write your programs in JavaScript if you want to distribute them, but you can use your shared Python variables without any modifications. Similarly, the variables you receive back from a program completed by DCP are in JavaScript. Bifrost converts these back into Python, and allows you to perform visualizations, post-processing, or to return them to a database.

Even though you can’t (yet!) run Python code on DCP, you can shift the variables to JavaScript when there is a compute-heavy, parallelizable step.

How does one use Bifrost

To use Bifrost, import it into your project with the following code:

%%capture

# Downloads and installs a customized Node.js backend for this notebook
# This may take a few moments when first executed in a new runtime

!npm install -g n && n 10.20.1
!pip install git+https://github.com/Kings-Distributed-Systems/Bifrost

from bifrost import npm, node

When the cell finished, the Node.js Runtime can use all the shareable variables in Python.

You can use this in either a computational notebook by tagging a cell with the cell magic %%node, or in any Python environment by using the following template:

resulting_dictionary_of_variables = node.run( js code as string, dictionary mapping of variables to values );

What environments does Bifrost support

At present, you can launch programs that use DCP from three locations:

  1. A local Node.js instance

  2. Vanilla Web

  3. Any Bifrost enabled runtime (just Python, for now)

Examples of Bifrost

The best way to learn what Bifrost does is to see it in action.

Below is an example ML project that uses Python variables and converts them into JavaScript with Bifrost. It’s copied directly from a Google Colab example that performs other DCP operations, but this is more step-by-step documentation directly for Bifrost itself.

Using the Distributed Compute Protocol for TensorFlow

This notebook presents a small machine learning project using TensorFlow, the MNIST dataset, and the Distributed Compute Protocol (DCP). In it, it accesses an external server cluster with just a couple commands.

Setting up your Node.js environment

The first step in using DCP is to install the appropriate version of Node.js, along with the required repositories and libraries. One can deploy computational tasks to DCP from either Node.js or the browser environment.

The interface between Python and Node.js inside the notebook is a customized tool called Bifrost. This tool allows developers to use primitive variables from Python and NumPy Arrays in their JavaScript DCP applications.

%%capture

# Downloads and installs a customized Node.js backend for this notebook
# This may take a few moments when first executed in a new runtime

!npm install -g n && n 10.20.1
!pip install git+https://github.com/Kings-Distributed-Systems/Bifrost

from bifrost import npm, node

Configuring dcp-client

With the Bifrost interface in place, you can use npm.install to install Node.js packages into the Colab environment.

In this case, you want to install DCP-Client with the following single line of code:

%%capture

# Installs DCP-Client using the notebook's Node.js backend

npm.install('dcp-client')

The next step is to connect this Colab notebook with a remote server cluster, so that it can execute in parallel across any number of nodes. The client that was just downloaded grants access to this cluster.

Using this client requires an API Key called a ‘keystore’. You should have received one alongside the URL to this Colab. If you don’t have a keystore, read Getting setup.

# Loads the ID used to deploy and pay for jobs on DCP

# When prompted, please upload the keystore file that you were provided with

from google.colab import files

KEYSTORE_NAME = list(files.upload().keys())[0]
!mkdir -p ~/.dcp && cp /content/$KEYSTORE_NAME ~/.dcp/id.keystore && cp ~/.dcp/id.keystore ~/.dcp/default.keystore
...
Upload widget is only available when the cell has been executed in the
current browser session. Please rerun this cell to enable.
...
Saving colab.keystore to colab.keystore
%%node

// # Points DCP to your uploaded ID, and initializes the client

require('dcp-client').initSync();

const compute = require('dcp/compute');
const dcpCli = require('dcp/dcp-cli');

Hyperparameter tuning MNIST models over the Distributed Compute Protocol

The next step is to proceed with finding an optimal set of hyperparameters for an MNIST character-recognition model to train. This is the first compute ‘workload’.

Discrete units of compute called slices divides the workload, each representing a neural network model with a different set of hyperparameters. DCP processes them in parallel on servers connected to this particular cluster. As the cluster computes the slices, the Colab notebook receives the completed slices.

# Declare the functions that we will use to create a population of random hyperparameter sets

import math, random

def create_generation(population_size, possible_parameters):

    def random_parameter_from_key(my_key):
        random_index = math.floor(random.random() * len(possible_parameters[my_key]))
        return possible_parameters[my_key][random_index]

    new_population = []

    for x in range(population_size):
        new_member = {}
        for key in possible_parameters:
            new_member[key] = random_parameter_from_key(key)
        new_population.append(new_member)

    return new_population
%%node

// # Define the hyperparameter tuning function that will be sent out to our workers across the DCP network

tuningFunction = `async function(modelParameters) {

  md = require('mnist');
  tf = require('tfjs');
  tf.setBackend('cpu');

  progress(0);

  // # Construct model based on set of hyperparameters provided to this worker

  let myModel = tf.sequential();
  myModel.add(tf.layers.flatten({inputShape: [28, 28, 1]}));
  for (let i = 0; i < modelParameters.num_layers; i++){
      myModel.add(tf.layers.dense({units: modelParameters.num_units, activation: modelParameters.activation}));
  }
  myModel.add(tf.layers.dense({units: 10, activation: 'softmax'}));
  myModel.compile({optimizer: tf.train[modelParameters.optimizer.toLowerCase()](modelParameters.lr), loss: 'categoricalCrossentropy', metrics: ['accuracy']});

  progress();

  // # Process MNIST character recognition data for training

  let myData = await md.load();

  //let myImages = new Float32Array(myData.images);
  //myImages = await myImages.map(x => x / 255.0);

  progress();

  let labelsTensor = await tf.tensor2d(myData.labels, [myData.labels.length / 10, 10]);
  let imagesTensor = await tf.tensor4d(myData.images, [myData.images.length / 784, 28, 28, 1]);

  progress();

  // # Train our model on MNIST data set, tracking loss and accuracy

  let myLoss;
  let myAccuracy;

  await myModel.fit(imagesTensor, labelsTensor, {
    batchSize: 100,
    epochs: 3,
    validationSplit: 0.15,
    callbacks: {onBatchEnd: async (batch, logs) => {
      progress();
    }, onEpochEnd: async (epoch, logs) => {
      myLoss = logs.val_loss;
      myAccuracy = logs.val_acc;
    }}
  });
  tf.dispose(myModel);

  progress(1.0);

  // # Return model hyperparameters, along with final loss and accuracy after training and validation

  return { parameters: modelParameters, loss: myLoss, accuracy: myAccuracy };
}`;

// # Declare the function that will deploy our hyperparameter tuning job to the DCP network

async function postJob(parameterSet, myMaxRuntime) {

    let myKeystore = await dcpCli.getAccountKeystore();

    const job = compute.for(parameterSet, tuningFunction);

    let myTimer = setTimeout(function(){
        job.cancel();
        console.log('Job reached ' + myMaxRuntime + ' minutes.');
    }, myMaxRuntime * 60 * 1000);

    job.public.name = 'DCP Colab Notebook - Hyperparameter Tuning';
    job.requires(['aistensorflow/tfjs', 'aitf-mnist-data/mnist']);

    job.on('accepted', () => {
        console.log('Job accepted: ' + job.id);
    });
    job.on('status', (status) => {
        console.log('STATUS:');
        console.log(
            status.total + ' slices posted, ' +
            status.distributed + ' slices distributed, ' +
            status.computed + ' slices computed.'
        );
    });
    job.on('result', (thisOutput) => {
        console.log('RESULT:');
        console.log(thisOutput.result);
        if (thisOutput.result.accuracy > tuning_best_result.accuracy) tuning_best_result = thisOutput.result;
    });

    try {
        await job.exec(compute.marketValue, myKeystore);
    } catch (myError) {
        console.log('Job halted.');
    }

    clearTimeout(myTimer);

    return(tuning_best_result);
}

Here, a population of different random sets of hyperparameters are being setup. You can make the number of sets in the population higher or lower by adjusting the variable population_size; there are thousands of possible combinations in the hyperparameter space that are possible. The number you enter is the number of sets that get packaged into slices, computed by nodes on the DCP network.

Furthermore, the variable tuning_max_runtime sets the max runtime for the hyperparameter tuning job. Some slices take longer than others, so here one needs to decide the max waiting time to get back all the results. Once this number of minutes have elapsed, the scheduler stops the job if computation is still ongoing. All the results from the slices that workers finished before that point are still present, and so the program can continue with the best-performing set of hyperparameters.

#@markdown #### Tuning Job Parameters

#@markdown Number of random hyperparameter sets for training:
population_size = 50 #@param {type:"slider", min:10, max:100, step:10}

#@markdown Maximum runtime allowed for the job, in minutes:
tuning_max_runtime = 15 #@param {type:"slider", min:5, max:30, step:5}

# Define the range of possible hyperparameters that we will be using for our models

parameter_space = {
    "activation": ['linear','relu','selu','sigmoid','softmax', 'tanh'],
    "optimizer": ['SGD','Adagrad','Adadelta','Adam','Adamax','RMSprop'],
    "num_layers": [1, 2, 3, 4, 5, 6],
    "num_units": [1, 2, 4, 8, 16, 32],
    "lr": [1, 0.1, 0.01, 0.001, 0.0001, 0.00001],
}

# Ongoing tracker of best performing set
tuning_best_result = { "accuracy": 0 }

# Generate a set of model hyperparameters
tuning_parameters = create_generation(population_size, parameter_space)
%%node

// # Call functions to generate a set of model hyperparameters, and deploy them to DCP for training in parallel

deployTime = Date.now();

postJob(tuning_parameters, tuning_max_runtime).then((value) => {
    console.log('Job complete.');

    console.log('Best accuracy found:');
    console.log(tuning_best_result.accuracy);

    let finalTime = Date.now() - deployTime;

    console.log('Total time to compute:')
    console.log((finalTime / 1000).toFixed(2) + ' seconds.')
});
Output  
Job accepted: golx5xRGNlTolpxu6eAHOk
STATUS:
50 slices posted, 0 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 1 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 2 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 3 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 4 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 5 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 6 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 7 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 8 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 9 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 10 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 11 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 12 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 13 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 14 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 15 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 16 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 17 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 18 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 19 slices distributed, 0 slices computed.
STATUS:
50 slices posted, 20 slices distributed, 0 slices computed.
RESULT:
{ parameters:
{ activation: 'selu',
optimizer: 'RMSprop',
num_layers: 5,
num_units: 8,
lr: 0.001 },
loss: 0.3785330653190613,
accuracy: 0.8913846015930176 }
STATUS:
50 slices posted, 20 slices distributed, 1 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'Adam',
num_layers: 3,
num_units: 1,
lr: 1 },
loss: 2.4013476371765137,
accuracy: 0.10123077034950256 }
STATUS:
50 slices posted, 20 slices distributed, 2 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'SGD',
num_layers: 5,
num_units: 8,
lr: 0.0001 },
loss: 2.3004343509674072,
accuracy: 0.07251282036304474 }
STATUS:
50 slices posted, 20 slices distributed, 3 slices computed.
RESULT:
{ parameters:
{ activation: 'selu',
optimizer: 'Adagrad',
num_layers: 4,
num_units: 8,
lr: 0.00001 },
loss: 2.3340208530426025,
accuracy: 0.0923076942563057 }
STATUS:
50 slices posted, 20 slices distributed, 4 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'SGD',
num_layers: 1,
num_units: 1,
lr: 0.001 },
loss: 2.2606990337371826,
accuracy: 0.19200000166893005 }
STATUS:
50 slices posted, 20 slices distributed, 5 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'SGD',
num_layers: 5,
num_units: 8,
lr: 0.0001 },
loss: 2.3452725410461426,
accuracy: 0.09856410324573517 }
STATUS:
50 slices posted, 20 slices distributed, 6 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'Adadelta',
num_layers: 1,
num_units: 16,
lr: 1 },
loss: 0.7750394344329834,
accuracy: 0.7326154112815857 }
STATUS:
50 slices posted, 20 slices distributed, 7 slices computed.
STATUS:
50 slices posted, 21 slices distributed, 7 slices computed.
RESULT:
{ parameters:
{ activation: 'tanh',
optimizer: 'Adagrad',
num_layers: 5,
num_units: 1,
lr: 0.00001 },
loss: 2.3025763034820557,
accuracy: 0.11292307823896408 }
STATUS:
50 slices posted, 21 slices distributed, 8 slices computed.
STATUS:
50 slices posted, 22 slices distributed, 8 slices computed.
STATUS:
50 slices posted, 23 slices distributed, 8 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'Adamax',
num_layers: 6,
num_units: 4,
lr: 0.00001 },
loss: 2.3019137382507324,
accuracy: 0.13558974862098694 }
STATUS:
50 slices posted, 23 slices distributed, 9 slices computed.
STATUS:
50 slices posted, 24 slices distributed, 9 slices computed.
STATUS:
50 slices posted, 25 slices distributed, 9 slices computed.
STATUS:
50 slices posted, 26 slices distributed, 9 slices computed.
RESULT:
{ parameters:
{ activation: 'softmax',
optimizer: 'SGD',
num_layers: 2,
num_units: 4,
lr: 0.001 },
loss: 2.303086042404175,
accuracy: 0.11292307823896408 }
STATUS:
50 slices posted, 26 slices distributed, 10 slices computed.
STATUS:
50 slices posted, 27 slices distributed, 10 slices computed.
STATUS:
50 slices posted, 28 slices distributed, 10 slices computed.
RESULT:
{ parameters:
{ activation: 'linear',
optimizer: 'RMSprop',
num_layers: 1,
num_units: 1,
lr: 0.001 },
loss: 1.7201460599899292,
accuracy: 0.3432820439338684 }
STATUS:
50 slices posted, 28 slices distributed, 11 slices computed.
STATUS:
50 slices posted, 29 slices distributed, 11 slices computed.
STATUS:
50 slices posted, 30 slices distributed, 11 slices computed.
STATUS:
50 slices posted, 31 slices distributed, 11 slices computed.
STATUS:
50 slices posted, 32 slices distributed, 11 slices computed.
STATUS:
50 slices posted, 33 slices distributed, 11 slices computed.
STATUS:
50 slices posted, 34 slices distributed, 11 slices computed.
STATUS:
50 slices posted, 35 slices distributed, 11 slices computed.
STATUS:
50 slices posted, 36 slices distributed, 11 slices computed.
STATUS:
50 slices posted, 37 slices distributed, 11 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'SGD',
num_layers: 4,
num_units: 2,
lr: 0.0001 },
loss: 2.3416709899902344,
accuracy: 0.10338461399078369 }
STATUS:
50 slices posted, 37 slices distributed, 12 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'RMSprop',
num_layers: 1,
num_units: 1,
lr: 0.001 },
loss: 1.941662311553955,
accuracy: 0.21015384793281555 }
STATUS:
50 slices posted, 37 slices distributed, 13 slices computed.
STATUS:
50 slices posted, 38 slices distributed, 13 slices computed.
STATUS:
50 slices posted, 39 slices distributed, 13 slices computed.
STATUS:
50 slices posted, 40 slices distributed, 13 slices computed.
RESULT:
{ parameters:
{ activation: 'selu',
optimizer: 'RMSprop',
num_layers: 6,
num_units: 16,
lr: 0.01 },
loss: 0.22380386292934418,
accuracy: 0.9342564344406128 }
STATUS:
50 slices posted, 40 slices distributed, 14 slices computed.
STATUS:
50 slices posted, 41 slices distributed, 14 slices computed.
STATUS:
50 slices posted, 42 slices distributed, 14 slices computed.
RESULT:
{ parameters:
{ activation: 'softmax',
optimizer: 'Adamax',
num_layers: 6,
num_units: 1,
lr: 0.01 },
loss: 2.3013417720794678,
accuracy: 0.11292307823896408 }
STATUS:
50 slices posted, 42 slices distributed, 15 slices computed.
STATUS:
50 slices posted, 43 slices distributed, 15 slices computed.
RESULT:
{ parameters:
{ activation: 'tanh',
optimizer: 'Adamax',
num_layers: 6,
num_units: 1,
lr: 0.001 },
loss: 1.9312543869018555,
accuracy: 0.20810256898403168 }
STATUS:
50 slices posted, 43 slices distributed, 16 slices computed.
STATUS:
50 slices posted, 44 slices distributed, 16 slices computed.
STATUS:
50 slices posted, 45 slices distributed, 16 slices computed.
STATUS:
50 slices posted, 46 slices distributed, 16 slices computed.
STATUS:
50 slices posted, 47 slices distributed, 16 slices computed.
RESULT:
{ parameters:
{ activation: 'tanh',
optimizer: 'Adagrad',
num_layers: 6,
num_units: 1,
lr: 0.01 },
loss: 1.9832453727722168,
accuracy: 0.19097435474395752 }
STATUS:
50 slices posted, 47 slices distributed, 17 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'Adam',
num_layers: 1,
num_units: 32,
lr: 0.001 },
loss: 0.1816585212945938,
accuracy: 0.9480000138282776 }
STATUS:
50 slices posted, 47 slices distributed, 18 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'SGD',
num_layers: 4,
num_units: 8,
lr: 0.1 },
loss: 2.3019304275512695,
accuracy: 0.11292307823896408 }
STATUS:
50 slices posted, 47 slices distributed, 19 slices computed.
RESULT:
{ parameters:
{ activation: 'tanh',
optimizer: 'Adamax',
num_layers: 3,
num_units: 1,
lr: 0.01 },
loss: 1.8381167650222778,
accuracy: 0.23887179791927338 }
STATUS:
50 slices posted, 48 slices distributed, 19 slices computed.
STATUS:
50 slices posted, 48 slices distributed, 20 slices computed.
STATUS:
50 slices posted, 49 slices distributed, 20 slices computed.
STATUS:
50 slices posted, 50 slices distributed, 20 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'Adadelta',
num_layers: 5,
num_units: 16,
lr: 0.0001 },
loss: 2.3923656940460205,
accuracy: 0.09600000083446503 }
STATUS:
50 slices posted, 50 slices distributed, 21 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'Adadelta',
num_layers: 6,
num_units: 4,
lr: 0.00001 },
loss: 2.302671432495117,
accuracy: 0.10451281815767288 }
STATUS:
50 slices posted, 50 slices distributed, 22 slices computed.
RESULT:
{ parameters:
{ activation: 'tanh',
optimizer: 'RMSprop',
num_layers: 2,
num_units: 4,
lr: 1 },
loss: 11.533044815063477,
accuracy: 0.1007179468870163 }
STATUS:
50 slices posted, 50 slices distributed, 23 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'Adagrad',
num_layers: 5,
num_units: 1,
lr: 0.0001 },
loss: 2.303934097290039,
accuracy: 0.1007179468870163 }
STATUS:
50 slices posted, 50 slices distributed, 24 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'Adam',
num_layers: 1,
num_units: 1,
lr: 0.0001 },
loss: 2.2228894233703613,
accuracy: 0.1890256404876709 }
STATUS:
50 slices posted, 50 slices distributed, 25 slices computed.
RESULT:
{ parameters:
{ activation: 'selu',
optimizer: 'RMSprop',
num_layers: 6,
num_units: 16,
lr: 0.1 },
loss: 14.297988891601562,
accuracy: 0.11292307823896408 }
STATUS:
50 slices posted, 50 slices distributed, 26 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'Adagrad',
num_layers: 4,
num_units: 32,
lr: 1 },
loss: 2.3031649589538574,
accuracy: 0.11292307823896408 }
STATUS:
50 slices posted, 50 slices distributed, 27 slices computed.
RESULT:
{ parameters:
{ activation: 'softmax',
optimizer: 'Adamax',
num_layers: 3,
num_units: 1,
lr: 0.0001 },
loss: 2.328644037246704,
accuracy: 0.10246153920888901 }
STATUS:
50 slices posted, 50 slices distributed, 28 slices computed.
RESULT:
{ parameters:
{ activation: 'selu',
optimizer: 'Adamax',
num_layers: 6,
num_units: 16,
lr: 0.1 },
loss: 0.32777613401412964,
accuracy: 0.91661536693573 }
STATUS:
50 slices posted, 50 slices distributed, 29 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'SGD',
num_layers: 1,
num_units: 4,
lr: 0.1 },
loss: 0.5345848798751831,
accuracy: 0.8333333134651184 }
STATUS:
50 slices posted, 50 slices distributed, 30 slices computed.
RESULT:
{ parameters:
{ activation: 'linear',
optimizer: 'Adamax',
num_layers: 1,
num_units: 2,
lr: 0.0001 },
loss: 1.9837646484375,
accuracy: 0.3029743731021881 }
STATUS:
50 slices posted, 50 slices distributed, 31 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'RMSprop',
num_layers: 2,
num_units: 4,
lr: 1 },
loss: 2.6522903442382812,
accuracy: 0.09600000083446503 }
STATUS:
50 slices posted, 50 slices distributed, 32 slices computed.
RESULT:
{ parameters:
{ activation: 'softmax',
optimizer: 'RMSprop',
num_layers: 4,
num_units: 4,
lr: 1 },
loss: 3.0329971313476562,
accuracy: 0.10338461399078369 }
STATUS:
50 slices posted, 50 slices distributed, 33 slices computed.
RESULT:
{ parameters:
{ activation: 'linear',
optimizer: 'RMSprop',
num_layers: 5,
num_units: 16,
lr: 0.0001 },
loss: 0.571458637714386,
accuracy: 0.836717963218689 }
STATUS:
50 slices posted, 50 slices distributed, 34 slices computed.
RESULT:
{ parameters:
{ activation: 'selu',
optimizer: 'Adamax',
num_layers: 6,
num_units: 4,
lr: 0.001 },
loss: 1.318507432937622,
accuracy: 0.5468717813491821 }
STATUS:
50 slices posted, 50 slices distributed, 35 slices computed.
RESULT:
{ parameters:
{ activation: 'linear',
optimizer: 'Adadelta',
num_layers: 6,
num_units: 16,
lr: 1 },
loss: 14.297988891601562,
accuracy: 0.11292307823896408 }
STATUS:
50 slices posted, 50 slices distributed, 36 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'Adadelta',
num_layers: 1,
num_units: 32,
lr: 0.01 },
loss: 1.288211464881897,
accuracy: 0.7542564272880554 }
STATUS:
50 slices posted, 50 slices distributed, 37 slices computed.
RESULT:
{ parameters:
{ activation: 'selu',
optimizer: 'SGD',
num_layers: 2,
num_units: 2,
lr: 0.01 },
loss: 1.625891089439392,
accuracy: 0.32256409525871277 }
STATUS:
50 slices posted, 50 slices distributed, 38 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'Adagrad',
num_layers: 5,
num_units: 32,
lr: 0.00001 },
loss: 2.3079874515533447,
accuracy: 0.07846153527498245 }
STATUS:
50 slices posted, 50 slices distributed, 39 slices computed.
RESULT:
{ parameters:
{ activation: 'softmax',
optimizer: 'Adamax',
num_layers: 6,
num_units: 4,
lr: 1 },
loss: 2.3245961666107178,
accuracy: 0.08892307430505753 }
STATUS:
50 slices posted, 50 slices distributed, 40 slices computed.
RESULT:
{ parameters:
{ activation: 'tanh',
optimizer: 'Adam',
num_layers: 4,
num_units: 16,
lr: 0.00001 },
loss: 2.000369071960449,
accuracy: 0.4795897305011749 }
STATUS:
50 slices posted, 50 slices distributed, 41 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'Adagrad',
num_layers: 5,
num_units: 8,
lr: 0.00001 },
loss: 2.3882391452789307,
accuracy: 0.08892307430505753 }
STATUS:
50 slices posted, 50 slices distributed, 42 slices computed.
RESULT:
{ parameters:
{ activation: 'linear',
optimizer: 'Adagrad',
num_layers: 5,
num_units: 8,
lr: 0.01 },
loss: 0.4534589946269989,
accuracy: 0.8678974509239197 }
STATUS:
50 slices posted, 50 slices distributed, 43 slices computed.
RESULT:
{ parameters:
{ activation: 'tanh',
optimizer: 'RMSprop',
num_layers: 4,
num_units: 8,
lr: 0.01 },
loss: 0.3614780902862549,
accuracy: 0.9039999842643738 }
STATUS:
50 slices posted, 50 slices distributed, 44 slices computed.
RESULT:
{ parameters:
{ activation: 'linear',
optimizer: 'Adadelta',
num_layers: 2,
num_units: 16,
lr: 0.0001 },
loss: 2.2180862426757812,
accuracy: 0.1653333306312561 }
STATUS:
50 slices posted, 50 slices distributed, 45 slices computed.
RESULT:
{ parameters:
{ activation: 'sigmoid',
optimizer: 'Adamax',
num_layers: 1,
num_units: 8,
lr: 0.001 },
loss: 0.9384765625,
accuracy: 0.8393846154212952 }
STATUS:
50 slices posted, 50 slices distributed, 46 slices computed.
RESULT:
{ parameters:
{ activation: 'linear',
optimizer: 'RMSprop',
num_layers: 2,
num_units: 32,
lr: 0.001 },
loss: 0.2835576832294464,
accuracy: 0.9228717684745789 }
STATUS:
50 slices posted, 50 slices distributed, 47 slices computed.
RESULT:
{ parameters:
{ activation: 'relu',
optimizer: 'Adamax',
num_layers: 5,
num_units: 8,
lr: 0.0001 },
loss: 2.0433709621429443,
accuracy: 0.2981538474559784 }
STATUS:
50 slices posted, 50 slices distributed, 48 slices computed.
RESULT:
{ parameters:
{ activation: 'tanh',
optimizer: 'Adagrad',
num_layers: 6,
num_units: 32,
lr: 0.00001 },
loss: 2.2852320671081543,
accuracy: 0.12092307955026627 }
STATUS:
50 slices posted, 50 slices distributed, 49 slices computed.
RESULT:
{ parameters:
{ activation: 'softmax',
optimizer: 'RMSprop',
num_layers: 6,
num_units: 32,
lr: 0.0001 },
loss: 2.301220417022705,
accuracy: 0.11292307823896408 }
STATUS:
50 slices posted, 50 slices distributed, 50 slices computed.
Job complete.
Best accuracy found:
0.9480000138282776
Total time to compute:
595.79 seconds.

Describing the console output of hyperparameter searches

What you see are discrete units of ‘compute’ that DCP is distributing (data & methods transmitted to worker nodes), and computed. In this case, they’re the hyperparameter searches. As the results come back, they’re displayed in the console.

Build a TensorFlow model from the best-performing hyperparameters

Now, load the MNIST dataset to train the model with using TensorFlow. This training uses the most accurate hyperparameter set found in the previous step.

# Load MNIST and character recognition dataset in notebook

import tensorflow as tf

(_x_train, _y_train),(_x_test, _y_test) = tf.keras.datasets.mnist.load_data()

# Normalize image pixel data between 0 and 1, and convert labels to one-hot format

_x_train = _x_train / 255.0
_y_train = tf.keras.utils.to_categorical(_y_train, num_classes = 10)

# Construct and train model in notebook, using best-performing hyperparameters from above

_optimizer_name = getattr(tf.keras.optimizers, tuning_best_result['parameters']['optimizer'])
_model_optimizer = _optimizer_name(learning_rate = tuning_best_result['parameters']['lr'])

_model = tf.keras.models.Sequential()

_model.add(tf.keras.layers.Flatten(input_shape = (28,28,1)))

for x in range(tuning_best_result['parameters']['num_layers']):
  _model.add(tf.keras.layers.Dense(tuning_best_result['parameters']['num_units'], activation = tuning_best_result['parameters']['activation']))

_model.add(tf.keras.layers.Dense(10, activation = 'softmax'))

_model.compile(
    optimizer = _model_optimizer,
    loss = 'categorical_crossentropy',
    metrics= ['accuracy'])

_model.fit(_x_train, _y_train, epochs = 3, validation_split = 0.15, batch_size = 100)
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step
Epoch 1/3
510/510 [==============================] - 1s 3ms/step - loss: 0.5003 - accuracy: 0.8644 - val_loss: 0.2389 - val_accuracy: 0.9364
Epoch 2/3
510/510 [==============================] - 1s 2ms/step - loss: 0.2431 - accuracy: 0.9306 - val_loss: 0.1895 - val_accuracy: 0.9494
Epoch 3/3
510/510 [==============================] - 1s 2ms/step - loss: 0.2000 - accuracy: 0.9440 - val_loss: 0.1650 - val_accuracy: 0.9547
<tensorflow.python.keras.callbacks.History at 0x7fa7cdaa0dd8>

Saving your model with TensorFlow.js Converter

Now, save the locally trained model to TensorFlow.js. This is how the trained Python model can perform inferencing in parallel on the DCP network.

%%capture

# Download and install model conversion tool

!pip install tensorflowjs==2.1.0
!git clone https://github.com/Kings-Distributed-Systems/tfjs_util.git
!cd tfjs_util && npm i && npm run postinstall
# Convert trained model from python to javacript

import tensorflowjs as tfjs
import random, string

tfjs.converters.save_keras_model(_model, './tfjs_model')

# Upload saved javascript model to the DCP network for inferencing

MODULE_NAME = 'colab-' + ''.join(random.choice(string.ascii_lowercase) for i in range(25))

!node /content/tfjs_util/bin/serializeModel.js -m ./tfjs_model/model.json -o $MODULE_NAME/model.js -p 0.0.1 -d
/usr/local/lib/python3.6/dist-packages/tensorflowjs/converters/keras_h5_conversion.py:123: H5pyDeprecationWarning: The default file mode will change to 'r' (read-only) in h5py 3.0. To suppress this warning, pass the mode you need to h5py.File(), or set the global default h5.get_config().default_file_mode, or set the environment variable H5PY_DEFAULT_READONLY=1. Available modes are: 'r', 'r+', 'w', 'w-'/'x', 'a'. See the docs for details.
  return h5py.File(h5file)

2020-11-12 18:27:48.582011: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-11-12 18:27:48.596924: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2020-11-12 18:27:48.597180: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x44dee00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-11-12 18:27:48.597218: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
Module published at :  colab-heauiezjsqvoagbvvstcldwao/model.js
Done!

The last step is to use the model for inferencing with DCP.

As before, the DCP network lets the program tap into an external server cluster. This method gives the ability to scale up the inferencing volume, without paying a significant sum or locking the program into a cloud operator like AWS.

%%node

// # Define the inferencing function that will be sent out to our workers across the DCP network

inferenceFunction = `async function(myData) {

  progress(0);

  tf = require('tfjs');
  tf.setBackend('cpu');

  // # Load our saved model into the worker

  let myModel = await require('model').getModel();

  progress();

  // # Convert testing data to an array and normalize between 0 and 1

  myData = await myData.split(',');
  myData = await myData.map(x => x / 255.0);

  progress();

  // # Convert normalized testing data to a tensor for prediction

  let imagesTensor = await tf.tensor4d(myData, [myData.length / 784, 28, 28, 1]);

  progress();

  // # Run saved model on the testing data tensor

  let predictResults = await tf.tidy(() => {

    const output = myModel.predict(imagesTensor);

    const axis = 1;
    const myPredictions = Array.from(output.argMax(axis).dataSync());

    return {predictions: myPredictions};
  });

  // # Release tensorflow memory, report work completion for this slice and return the predictions to the client notebook

  tf.dispose(myModel);

  progress(1.0);

  return predictResults;
}`;

// # Declare the function that will deploy our inferencing job to the DCP network

async function inferenceJob(inferenceData, inferenceLabels, myMaxRuntime) {

    let myKeystore = await dcpCli.getAccountKeystore();

    const job = compute.for(inferenceData, inferenceFunction);

    let myTimer = setTimeout(function(){
        job.cancel();
        console.log('Job reached ' + myMaxRuntime + ' minutes.');
    }, myMaxRuntime * 60 * 1000);

    job.public.name = 'DCP Colab Notebook - Saved Models';
    job.requires([`${MODULE_NAME}/model`, 'aistensorflow/tfjs'])

    job.on('accepted', () => {
        console.log('Job accepted: ' + job.id);
    });
    job.on('status', (status) => {
        console.log('STATUS:');
        console.log(
            status.total + ' slices posted, ' +
            status.distributed + ' slices distributed, ' +
            status.computed + ' slices computed.'
        );
    });
    job.on('result', (thisOutput) => {

        let sliceIndex = thisOutput.sliceNumber;
        let myPredictions = thisOutput.result.predictions;

        let correctCount = 0;
        for (let i = 0; i < myPredictions.length; i++) {
            if (myPredictions[i] == inferenceLabels[sliceIndex][i]) correctCount++;
        }
        console.log('RESULT:');
        console.log(correctCount + ' / ' + myPredictions.length + ' ( ' + ( correctCount / myPredictions.length * 100).toFixed(2) + '% )');

    });

    try {
        await job.exec(compute.marketValue, myKeystore);
    } catch (myError) {
        console.log('Job halted.');
    }

    clearTimeout(myTimer);

    return('\nJob complete.\n');
}
#@markdown # Inference Job Parameters

#@markdown Desired number of parallel workers; the testing data will be divided into this many batches:
slice_count = 10 #@param {type:"slider", min:10, max:100, step:10}

#@markdown Maximum runtime allowed for the job, in minutes:
inference_max_runtime = 5 #@param {type:"slider", min:5, max:30, step:5}

# Make MNIST character recognition testing data loaded earlier available to the Node.js context
xTest = _x_test
yTest = _y_test
%%node

// # Arrange testing data in batches of the number of images to be distributed to each worker

xSize = xTest.typedArray.length / slice_count;
ySize = yTest.typedArray.length / slice_count;

testingImages = [];
testingLabels = [];``

for (let i = 0; i < slice_count; i++) {
    testingImages.push(xTest.typedArray.slice(i * xSize, (i + 1) * xSize).toString());
    testingLabels.push(yTest.typedArray.slice(i * ySize, (i + 1) * ySize));
}
%%node

// # Calls functions to deploy the saved model and testing data to the DCP network for inferencing in parallel

inferenceJob(testingImages, testingLabels, inference_max_runtime).then((value) => {
    console.log(value);
});
Job accepted: el5Us9nrlrVrnw54nyN97R
STATUS:
10 slices posted, 0 slices distributed, 0 slices computed.
STATUS:
10 slices posted, 1 slices distributed, 0 slices computed.
STATUS:
10 slices posted, 2 slices distributed, 0 slices computed.
RESULT:
947 / 1000 ( 94.70% )
STATUS:
10 slices posted, 2 slices distributed, 1 slices computed.
RESULT:
921 / 1000 ( 92.10% )
STATUS:
10 slices posted, 2 slices distributed, 2 slices computed.
STATUS:
10 slices posted, 3 slices distributed, 2 slices computed.
RESULT:
940 / 1000 ( 94.00% )
STATUS:
10 slices posted, 3 slices distributed, 3 slices computed.
STATUS:
10 slices posted, 4 slices distributed, 3 slices computed.
STATUS:
10 slices posted, 5 slices distributed, 3 slices computed.
STATUS:
10 slices posted, 6 slices distributed, 3 slices computed.
STATUS:
10 slices posted, 7 slices distributed, 3 slices computed.
STATUS:
10 slices posted, 8 slices distributed, 3 slices computed.
RESULT:
963 / 1000 ( 96.30% )
STATUS:
10 slices posted, 8 slices distributed, 4 slices computed.
RESULT:
936 / 1000 ( 93.60% )
STATUS:
10 slices posted, 8 slices distributed, 5 slices computed.
RESULT:
977 / 1000 ( 97.70% )
STATUS:
10 slices posted, 8 slices distributed, 6 slices computed.
RESULT:
968 / 1000 ( 96.80% )
STATUS:
10 slices posted, 8 slices distributed, 7 slices computed.
STATUS:
10 slices posted, 9 slices distributed, 7 slices computed.
STATUS:
10 slices posted, 10 slices distributed, 7 slices computed.
RESULT:
922 / 1000 ( 92.20% )
STATUS:
10 slices posted, 10 slices distributed, 8 slices computed.
RESULT:
950 / 1000 ( 95.00% )
STATUS:
10 slices posted, 10 slices distributed, 9 slices computed.
RESULT:
976 / 1000 ( 97.60% )
STATUS:
10 slices posted, 10 slices distributed, 10 slices computed.
Job complete.

Describing the console output of inferencing examples

What you see are discrete units of ‘compute’ that DCP is distributing (data & methods transmitted to worker nodes), and computed. In this case, they’re the inferencing examples. As the results come back, they’re displayed in the console.

This is just one example of DCP. These basic steps can help speed up other AI/ML frameworks like PyTorch and Keras , as well as non-AI applications. As in this example, there is no need to manually provision or orchestrate compute resources.

Distributed, cloud-based servers make up the cluster that powers this example. At scale, the cost of these is ~80% less than the alternative from a public cloud like AWS or Microsoft Azure. DCP also lets developers build with private, internal networks made of underutilized machines. Private networks result in cost savings of 95% or more.

To learn more and get in contact with the developers, please email info@kingsds.network.

Future extensions on Bifrost

At the moment, Bifrost supports the JavaScript <-> Python variable syncing of basic Python variables and NumPy arrays. In the future as of March 2021, the core development team plans to add more features which may include support for other programming languages.