As machine learning gains even more grounds in the world, companies are adapting to the changes that are needed to effectively utilize the opportunities it offers. In-house data scientists are usually given a Jupyter notebook backed by a GPU instance in the cloud while a separate team manages the deployment and serving. But as the complexity of the applications and the number of deployments grow, the system breaks down. This has made many companies be constantly in search of machine learning platforms.
While companies like Apple and Uber build their platforms, other companies look to startups to get end-to-end machine learning platforms. But even with this, many companies still have challenges yet unsolved. The way forward is for companies to look for ML platforms that are well-positioned for modern AI applications. How do entrepreneurs who need machine learning to speed up work identify such modern AI applications? In an article on the website Dataversity, key features of machine learning platforms are looked at.
See below what they are:
Developers and machine learning engineers use a variety of tools and programming languages (R, Python, Julia, SAS, etc.). But with the rise of deep learning, Python has become the dominant programming language for machine learning. So if anything, an ML platform needs to support Python and the Python ecosystem.
As a practical matter, developers and machine learning engineers rely on many different Python libraries and tools. The most widely used libraries include deep learning tools (TensorFlow, PyTorch), machine learning and statistical modeling libraries (scikit-learn, statsmodels), NLP tools (spaCy, Hugging Face, AllenNLP), and model tuning (Hyperopt, Tune).
Because it integrates seamlessly in the Python ecosystem, many developers are leveraging Ray for building machine learning tools. It is a general-purpose distributed computing platform that can be used to easily scale existing Python libraries and applications. It also has a growing collection of standalone libraries available to Python developers.
As we notedin The Future of Computing is Distributed, the “demands of machine learning applications are increasing at a breakneck speed.” The rise of deep learning and new workloads means that distributed computing will be common for machine learning. Unfortunately, many developers have relatively little experience in distributed computing.
Scaling and distributed computation are areas where Ray has been helpful to many users we’ve spoken with. It allows developers to focus on their applications instead of on the intricacies of distributed computing. Using it brings several benefits to developers needing to scale machine learning applications:
- It’s a platform that lets you easily scale your existing applications to a cluster. This could be as simple as scaling your favorite library to a compute cluster (see this recent post on using it to scale scikit-learn). Or it could involve using the API to scale an existing program to a cluster. The latter scenario is one we’ve seen happen for applications in NLP,online learning, fraud detection, financial time-series, OCR, and many other use cases.
- RaySGD simplifies distributed training for PyTorch and TensorFlow. This is good news for the many companies and developers who struggle with training or tuning large neural networks.
- Instead of spending time on DevOps, a built-in cluster launcher makes it simple to set up a cluster.
Extensibility for New Workloads
Modern AI platforms are notoriously compute hungry. In The Future of Computing is Distributed post referenced above, we noted that model tuning is an important part of the machine learning development process:
“You don’t train a model just once. Typically, the quality of a model depends on a variety of hyperparameters, such as the number of layers, the number of hidden units, and the batch size. Finding the best model often requires searching among various hyperparameter settings. This process is called hyperparameter tuning, and it can be very expensive.” Ion Stoica
Developers can choose from several libraries for tuning models. One of the more popular tools is Tune, a scalable hyperparameter tuning library built on top of Ray. Tune runs on a single node or on a cluster and has quickly become one of the more popular libraries in this ecosystem.
Reinforcement learning (RL) is another area worth highlighting. Many of the recent articles about RL pertain to gameplay (Atari, Go, multiplayer video games) or to applications in industrial settings (e.g., data center efficiency). But, as we’ve previously noted, there are emerging applications in recommendations and personalization, simulation and optimization, financial time series, and public policy.
RL is compute-intensive, complex to implement and scale, and, as such, many developers will want to simply use libraries. Ray provides a simple, highly scalable library (RLlib) that developers and machine learning engineers across several organizations are already using in production.
Tools Designed for Teams
As companies begin to use and deploy more machine learning models, teams of developers will need to be able to collaborate with each other. They will need access to platforms that enable both sharing and discovery. When considering an ML platform, consider the key stages of model development and operations, and assume that teams of people with different backgrounds will collaborate during each of those phases.
For example, feature stores (first introduced by Uber in 2017) are useful because they allow developers to share and discover features that they might otherwise not have thought about. Teams also need to be able to collaborate during the model development lifecycle. This includes managing, tracking, and reproducing experiments. The leading open source project in this area is MLflow, but we’ve come across users of other tools like Weights & Biases and Comet, as well as users who have built their own tools to manage ML experiments.
Enterprises will require additional features — including security and access control. Model governance and model catalogs (analogs of similar systems for managing data) are also going to be required as teams of developers build and deploy more models.
First Class MLOps
MLOps is a set of practices focused on productionizing the machine learning lifecycle. It is a relatively new term that draws ideas from continuous integration (CI) and continuous deployment (CD), two widely used software development practices. A recent Thoughtworks post listed some key considerations for establishing CD for machine learning (CD4ML). Some key items for CI/CD for machine learning include reproducibility, experiment management and tracking, model monitoring and observability, and more. There are a few startups and open source projects that offer MLOps solutions, including Datatron, Verta, TFX, and MLflow.
Ray has components that would be useful for companies moving towards CI/CD or building CI/CD tools for machine learning. It already has libraries for key stages of the ML lifecycle: training (RaySGD), tuning (Tune), and deployment (Serve). Having access to libraries that work seamlessly together will allow users to more readily bring CI/CD methods into their MLOps practice.