inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Use streaming generation for output in model.generate(inputs, max_new_tokens=50, streamer=True): print(tokenizer.decode(output, skip_special_tokens=True), end="")
Use sequential prefill model = AutoModelForCausalLM.from_pretrained(model_id, use_sequential_prefill=True)
Handle long sequences long_prompt = "..." Very long prompt inputs = tokenizer(long_prompt, return_tensors="pt") outputs = model.generate(inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Falcon Mamba 7B will be released under the TII Falcon License 2.0, which is an Apache 2.0-based license that includes a usage policy promoting responsible use of AI.
The model can break through sequence extension limitations without sacrificing performance. It can handle sequences of arbitrary length without increasing any memory storage, especially it can run on a single A10 24GB GPU. The time required to generate new tokens remains constant regardless of context size.
On multiple benchmarks, Falcon Mamba 7B outperforms leading models of the same size level, such as Llama 3.1 8B and Mistral 7B. It is the first general-purpose large-scale Mamba model capable of handling various text generation tasks.
This open-source model provides researchers and developers with the opportunity to explore and leverage the potential of SSLM architecture, promising breakthroughs in handling long text sequences and improving generation efficiency.