The open-source distributed AI inference framework exo on GitHub has gained 2.5k stars. It allows users to build their own AI computing cluster using everyday devices like iPhones and iPads in just minutes.
Unlike other distributed inference frameworks, exo uses a peer-to-peer (p2p) connection method, automatically adding devices to the cluster when connected to the network.
A developer used exo to connect two MacBook Pros and one Mac Studio, achieving a computing speed of 110 TFLOPS. The developer is ready for the upcoming Llama3-405B model, with exo officials promising day 0 support.
Exo can incorporate not only computers but also iPhones, iPads, and even Apple Watches into the local computing network. As the framework evolves, it's no longer Apple-exclusive, with some users adding Android phones and 4090 GPUs to their clusters.
The framework can be configured in as little as 60 seconds. It uses p2p connections instead of a master-worker architecture, automatically joining devices on the same local network to the computing network.
Exo supports different partitioning strategies for cross-device model splitting, with the default being ring memory-weighted partitioning. It requires minimal manual configuration, automatically connecting to devices running on the local network, with future support for Bluetooth connections.
The framework supports a graphical interface called tiny chat and has an OpenAI-compatible API. Currently, exo supports Apple's MLX framework and the open-source machine learning framework tinygrad, with llama.cpp adaptation in progress.
Due to iOS implementation lagging behind Python, the mobile and iPad versions of exo have been temporarily taken offline.
The local operation of large models has advantages in privacy protection, offline access, and personalized customization. Some argue that building clusters with existing devices for large model computing has lower long-term costs than cloud services.
However, concerns have been raised about the computing power of older devices compared to professional service providers, and the high cost of the high-end hardware used in demonstrations.
The framework's author clarified that exo transmits small activation vectors rather than entire model weights, minimizing the impact of network latency on performance.
While still in the experimental stage, the framework aims to become as simple to use as Dropbox in the future. The exo team has also listed some current limitations they plan to address, offering bounties of $100-500 for solutions.