AI News 17 Jan 2025

Показать описание

UMbreLLa enables running Llama3.3-70B models on consumer GPUs like the RTX 4070 Ti and RTX 4090 with impressive speeds of up to 9.7 tokens/sec and 11.4 tokens/sec respectively. It achieves this through parameter offloading, speculative decoding, and quantization (AWQ Q4), making high-end LLM inference accessible on affordable hardware, especially for coding tasks.

MCP introduced dynamic tool discovery, allowing clients to list available tools and receive real-time updates when tools change.

Google's Titans model incorporates a dedicated "long-term memory" at test time, adapting and updating its memory dynamically, scaling with linear time complexity for long input sequences, unlike traditional transformers. This model has a 2 million token context window and remembers surprising events, mimicking human-like memory.