AI galaxy hunters are adding to the global GPU crunch

OpenAI Launches Training Spec to Boost Large-Scale AI

The protocol is designed to improve GPU performance as AI compute ramps up.

May 6, 8:18 PM

ComputerWorld AIenterprises agentic ai nvidia amd

Enterprises need to think beyond GPUs for agentic AI, analysts say

The ongoing shift from generative AI (genAI) to agentic AI provides an opportunity for enterprises to move to more nimble and less expensive forms of computing, according to analysts. Early AI models were largely built on expensive GPUs from Nvidia and AMD that offered raw processing power. But newer agentic AI tools, rooted in business process and workflow management, can run on more efficient, cost-effective hardware. As a result, IT decision-makers who still think they require GPUs for anything AI-related need to reconsider their hardware options in terms of both cost and capabilities, analysts said. “A better way of thinking about this is the cost of AI compute and now agentic AI platform services or systems,” said Leonard Lee, principal analyst at Next Curve. “’AI computing’ or ‘accelerated computing’ has clearly transcended the GPU as an inference accelerator.” The new hardware options include CPUs and specialized AI chips, also known as ASICs in semiconductor parlance. Although

Apr 28, 6:25 PM

KDNuggetwhisper faster-whisper python privacy

Local Whisper Audio Transcription

Learn how to eranscribe audio locally using Faster‑Whisper and Python, with an emphasis on privacy‑first and CPU/GPU‑ready.

Apr 28, 2:00 PM

MarktechPostkvcached elastic kv cache memory vllm qwen2.5

A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing

In this tutorial, we explore kvcached, a dynamic KV-cache implementation on top of vLLM, to understand how dynamic KV-cache allocation transforms GPU memory usage for large language models. We begin by setting up the environment and deploying lightweight Qwen2.5 models through an OpenAI-compatible API, ensuring a realistic inference workflow. We then design controlled experiments where […] The post A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing appeared first on MarkTechPost.

Apr 25, 9:30 PM

NVidia Blogastronomy day gpus astronomers cosmic data

Making Sense of the Early Universe

This Spring Astronomy Day, here’s a look at how AI and GPUs are helping astronomers work through unprecedented volumes of cosmic data.

Apr 23, 1:00 PM

InfoWorld AIgpu capacity planning global retailer 70b model

How I doubled my GPU efficiency without buying a single new card

Late last year I got pulled into a capacity planning exercise for a global retailer that had wired a 70B model into their product search and recommendation pipeline. Every search query triggered an inference call. During holiday traffic their cluster was burning through GPU-hours at a rate that made their cloud finance team physically uncomfortable. They had already scaled from 24 to 48 H100s and latency was still spiking during peak hours. I was brought in to answer a simple question: Do we need 96 GPUs for the January sale or is something else going on? I started where I always start with these engagements: profiling. I instrumented the serving layer and broke the utilization data down by inference phase. What came back changed how I think about GPU infrastructure. During prompt processing — the phase where the model reads the entire user input in parallel — the H100s were running at 92% compute utilization. Tensor cores fully saturated. Exactly what you want to see on a $30K GPU. Bu

Apr 23, 9:00 AM

MarktechPostqwen 3.6-35b-a3b moe rag gpu

A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence

In this tutorial, we build an end-to-end implementation around Qwen 3.6-35B-A3B and explore how a modern multimodal MoE model can be used in practical workflows. We begin by setting up the environment, loading the model adaptively based on available GPU memory, and creating a reusable chat framework that supports both standard responses and explicit thinking […] The post A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence appeared first on MarkTechPost.

Apr 21, 7:54 AM

MarktechPostopenai gpt-oss google colab transformers

A End-to-End Coding Guide to Running OpenAI GPT-OSS Open-Weight Models with Advanced Inference Workflows

In this tutorial, we explore how to run OpenAI’s open-weight GPT-OSS models in Google Colab with a strong focus on their technical behavior, deployment requirements, and practical inference workflows. We begin by setting up the exact dependencies needed for Transformers-based execution, verifying GPU availability, and loading openai/gpt-oss-20b with the correct configuration using native MXFP4 quantization, […] The post A End-to-End Coding Guide to Running OpenAI GPT-OSS Open-Weight Models with Advanced Inference Workflows appeared first on MarkTechPost.

Apr 18, 3:39 AM