Disclosure: The HP OMEN MAX 16 was provided on loan by HP for review purposes and will be returned. No payment was received. Editorial independence was maintained throughout.

 

Introduction: The Nearly $7,000 Question

Every prompt you send to ChatGPT leaves your machine.

For everyday use, that’s usually fine. Asking about cooking techniques, drafting a birthday message, writing a resignation letter you’ve been putting off for three weeks: none of that is particularly sensitive. But when the prompt contains confidential client data, unpublished research, or proprietary code, the calculation changes entirely.

Cloud AI costs around $30 AUD per month. What it costs in loss of control is harder to quantify. Your data passes through someone else’s servers, subject to their retention policies and legal jurisdictions, often US-based, which adds complexity for Australian businesses navigating local privacy law.

Local AI removes that variable. The trade-off is hardware.

When HP offered laptops for review and asked what I’d test, my answer was straightforward: can a laptop genuinely handle local LLM workloads for real work? HP didn’t hesitate. Great start.

So here we are. The HP OMEN MAX 16, Intel Core Ultra 9 275HX, RTX 5080 with 16GB GDDR7 VRAM, 32GB system RAM, tested with Ollama and llama.cpp, models from 8B to 70B parameters, pushed to thermal limits, memory ceilings, and workflow reality. This is the first in what will be a series of benchmarks comparing local AI hardware, because one data point is just a starting position.

This isn’t a standard laptop review. I’m not here to talk about bezels or RGB lighting. Though for the record, the keyboard does light up with a rainbow that would make a nightclub jealous, and the screen is nearly glossy enough to double as a mirror when switched off. But that’s not why we’re here.

The question is simple: Is local AI on consumer hardware viable in 2026,
or is it still just expensive cosplay? The results are clearer than the industry narrative suggests.

HP OMEN MAX 16 Rainbow effect keyboard

Why Local? Beyond the Hype

Before diving into performance, it’s worth establishing why anyone would go down the local AI path. Let’s be honest: most users don’t need this level of control.

If you’re happy with ChatGPT and your biggest concern is whether it can write a decent cover letter, you probably don’t need to peek under the hood.

But if you’ve ever paused before hitting send on a prompt containing client data, or felt that twinge of concern about where your queries actually go, or wondered what happens to your conversation history after you close the browser, this is for you.

 

Data Sovereignty: When Cloud Isn’t an Option

For some users, local AI isn’t a preference. It’s a requirement.

A lawyer preparing a defence strategy can’t send case details to OpenAI’s servers without risking attorney-client privilege. Under Australian legal professional privilege rules, using cloud AI for sensitive case work creates potential disclosure risks. Using a local 13B model, they can draft motions, extract relevant passages from case documents, reorganise arguments, and summarise depositions, all without data leaving their laptop. The alternative is paying for enterprise-grade isolated instances, which is expensive and complex, or merely forgoing AI altogether.

A medical researcher analysing patient interviews, under the Privacy Act and potentially state-based health privacy legislation, may face restrictions on uploading transcripts to external cloud providers even when anonymised. A local 30B model can assist with thematic analysis, symptom extraction, and highlighting unusual patterns, all within the researcher’s controlled environment.

A startup developing proprietary software can’t afford to send prototype code or training approaches to external APIs. Each prompt introduces potential IP exposure. Running a 7B coding model locally allows them to iterate freely without expanding the exposure surface to third-party providers. That matters in Australia’s tightly connected tech ecosystem, where competitors often share investors, advisors, and talent pools.

Communications Consultancies with advanced marketing details of products not in market risks breaking embargo and leaking information.

 

Privacy: The Personal Angle

Beyond regulatory requirements, there’s the personal calculation. Do you want every half-formed idea, every draft email, every experimental prompt logged by a third party? Depending on provider terms and settings, prompts may be retained in ways you don’t directly control. Perhaps you trust OpenAI or Anthropic implicitly. Perhaps you don’t. Local AI shifts the execution boundary onto your own hardware. Model inference runs locally, on your GPU and CPU, within your operating system and network environment. There is no external API endpoint involved in processing your prompts, no provider-side retention policy attached to your inputs, and no automatic transmission layer required for inference. If data leaves the device, it does so due to your system configuration or network security posture, not because the model architecture depends on a third-party service.

Control: The Technical Angle

Cloud AI centralises decision-making. Model versions, response formatting, content filtering and rate limits are ultimately governed by the provider’s infrastructure and policies. You’re also at the mercy of model updates you didn’t ask for, whether your workflow is ready for them or not.

Local AI relocates those decisions to your own environment. You configure temperature, top-p and repetition penalties. Quantisation becomes a trade-off you control between speed, memory footprint, and output fidelity. System prompts can be modified, models fine-tuned, and multi-model workflows orchestrated without provider-imposed ceilings.

That control introduces complexity. For technical users, that complexity is often exactly the point.

 

The Hardware: What You’re Getting

The star of the show, for this review, is the 16GB of GDDR7 VRAM on the RTX 5080. For local LLM work, VRAM is your performance ceiling. Everything that fits entirely within VRAM runs fast. Everything that doesn’t fit offloads to system RAM, and that’s where performance degrades.
Think of VRAM as the VIP lounge. Models that get in have a great time. Models stuck outside in the queue: not so much.

A quick primer on quantisation, because it matters here. LLMs store their parameters as numbers. Full-precision models (FP16) are large and slow; quantisation compresses those numbers. Q4 uses 4 bits per parameter instead of 16, cutting model size by roughly 75% with minimal quality loss. Q4_K_M is a medium-quality 4-bit quantisation that balances size, speed, and output quality. A smaller number in the name generally means faster performance with some trade-off in output quality.

The 32GB of system RAM provides breathing room for offloading, but VRAM remains the golden gate. Cross it, and you’re in a different world of responsiveness.

One detail worth highlighting: the OMEN’s RAM is user-accessible via two SO-DIMM slots. Upgrading to 64GB later is straightforward for anyone comfortable opening a laptop. This matters significantly for local LLM work, because more system RAM allows larger models to offload more gracefully once they exceed VRAM capacity. If you’re brave enough to remove a few screws without losing them, you can double your RAM later. Mac users don’t have this option. On Apple Silicon, unified memory is soldered at purchase; capacity is fixed from day one.

Specification Detail
CPU Intel Core Ultra 9 275HX
GPU NVIDIA GeForce RTX 5080 (16GB GDDR7 VRAM)
RAM 32GB DDR5-5600 (upgradeable to 64GB)
Storage 1TB SSD
Display 16″ WQXGA OLED (2560×1600)
Price $6,999 AUD (RRP)

The hardware sets the ceiling. The tooling determines how close you get to it.

Task Manager GPU RTX5080 9GB post inference

 

Testing Framework: Ollama vs llama.cpp

For this review, I tested with both Ollama and llama.cpp running directly from the terminal to understand the practical differences. The short version is that for most users running models locally, Ollama wins on convenience with negligible performance trade-off on this hardware. Formal benchmarks and all reported tokens-per-second figures were gathered via Ollama in the terminal, as this provided the most consistent and reproducible results.

llama.cpp exposes the underlying inference engine directly. It provides detailed visibility into token generation speeds, memory usage, layer offloading behaviour and runtime parameters. If you’re benchmarking hardware, tuning performance or need diagnostic-level insight into how a model is executing, it remains the reference tool.

Ollama sits a layer above that engine and prioritises accessibility. Install it, pull a model, run it. No compilation, no configuration files, no driver archaeology. Its CLI still exposes controls such as temperature, context length and system prompts, but the tooling focuses on streamlined deployment rather than exposing every runtime detail.

Ollama also provides a lightweight desktop interface alongside the command-line tools. It offers a simple chat-style environment for interacting with local models without using the terminal. Under the hood it runs the same inference engine as the CLI, so performance remains comparable, but the graphical interface lowers the barrier for users who simply want to download a model and start experimenting.

On this OMEN, using identical quantisations and comparable settings, performance differences between Ollama and llama.cpp were small. In most cases they sat within single-digit percentages, occasionally approaching 10% in edge scenarios. Nothing that materially changes the user experience.

Think of it this way. llama.cpp is a manual transmission. Ollama is an automatic with paddle shifters. Most people just want to get where they’re going without thinking about gear ratios.

That said, llama.cpp remains invaluable for power users who want complete visibility into runtime behaviour or need to extract every last percentage point of performance from constrained hardware. For this review, I used both but structured formal testing around Ollama because that’s what most users will realistically reach for.

 

Performance: Where Theory Meets Reality

The VRAM Golden Gate

That 16GB envelope defines what the benchmarks below will show. Where models fit, performance is strong. Where they don’t, they stop at the gate.

Real-World Model Testing

All models were tested using Q4_K_M quantisation, with identical prompts across runs.

Llama 3.1 8B: Instantaneous. No perceptible delay between prompt submission and generation starting. At 121 t/s, this is desktop-class inference performance in a portable form factor. This is the sweet spot for interactive work, fast enough that it doesn’t interrupt your flow. If you’ve ever used ChatGPT and thought “this is pleasantly quick,” that’s the experience here, except it’s all happening on your laptop, offline, with zero external dependencies.

Qwen 2.5 14B: Noticeably more capable than the 8B on complex reasoning tasks, with only a modest hit to generation speed. Still fits comfortably within VRAM with room to spare. For users who find the 8B slightly too shallow on nuanced queries but don’t want to commit to larger model load times, this is a genuinely useful middle ground.

Qwen 2.5 32B: Usable. Responses took a few seconds to start, but once generating, output was smooth. This is a dense 32 billion parameter model, meaning every one of those parameters is active on every token. It requires some offloading to system RAM, but not enough to cause stuttering. For complex tasks where capability matters more than instant gratification, this was the sweet spot. The model takes a breath before speaking. Noticeable, but not frustrating.

Qwen3.5-35B-A3B: The name needs unpacking. Qwen3.5-35B-A3B uses a Mixture-of-Experts architecture: 35B total parameters, but only 3 billion active at any given moment. The A3B suffix is the tell. It also ships as a reasoning model, surfacing its intermediate reasoning steps by default. For “This is a test. Please respond with understood.” the reasoning trace ran longer than the final answer — a consequence of exposing the model’s internal reasoning tokens rather than hiding them behind the final response. On this hardware, it fit in memory and completed the warm-up at 22.38 t/s. It then crashed on every substantive benchmark prompt that followed. The KV cache section below explains why. The theory is compelling. The practice, right now, is not there yet.

Llama 3.1 70B: Slow, but functional in a way the 35B currently isn’t. Under performance mode with the laptop plugged in, generation sits at 1.28 to 1.84 tokens per second across all prompts. Usable for batch work where you queue queries and walk away. Interactive conversation? Absolutely not. You send a prompt, make a cup of tea, feel mildly guilty about running the hardware this hard, and come back to a coherent response waiting for you. It’s a workflow, just not a quick one.

Task Manager RAM 100pct maxed

 

Benchmark Results: Tokens Per Second

Three prompts were run across all models in Ollama via the terminal (ollama run modelname) with Windows performance mode enabled and the laptop plugged in. Results are measured in tokens per second (t/s) during generation. Higher is better.

All benchmarks were executed using default Ollama runtime parameters. No custom Modelfiles were used. The only exception was during troubleshooting of the Qwen 3.5 35B model, where parameter adjustments were briefly attempted in-session to diagnose instability.

Warm-up prompt:

This is a test. Please respond with ‘understood’.

Ran before each model to establish baseline behaviour. Because the output is short, generation figures reflect minimal sustained load rather than peak sustained throughput.

Prompt 1: Legal risks (capability test).

Summarise the key legal risks an Australian small business should consider before using cloud-based AI tools for client work.

Prompt 2: Instruction following (constraint test).

List exactly five practical use cases for local LLMs in a professional environment. Each must be one sentence. No introductory text.

Prompt 3: Drift test (limits test).

What is the most difficult question I could ask you, and why?

No defined endpoint. Designed to reveal model behaviour without guardrails.

Prompt Eval Rate (tokens/s)

How fast the model processes your input. Higher is better.

Model Test Prompt Prompt 1 Prompt 2 Prompt 3 Notes
Llama 3.1 8B 133.94 359.30 2310.97 2781.71 Spike reflects cached context from prior prompts.
Qwen 2.5 14B 187.16 294.91 2548.73 192.42 Prompt 3 drop may reflect context handling variance.
Qwen 2.5 32B 90.77 156.65 1202.44 821.16 Consistent growth pattern. Dense model behaves predictably.
Qwen3.5-35B-A3B 61.54 Reasoning model. Shows thinking process by default. Crashed on Test Prompt.
Llama 3.1 70B 1.60 1.78 47.67 63.62 Low baseline but context caching still visible.

LLM Prompt Processing Rates

 

Eval Rate (tokens/s)

How fast the model generates its response. This is the practically meaningful number for day-to-day use.

Model Test Prompt Prompt 1 Prompt 2 Prompt 3 Notes
Llama 3.1 8B 121.67 122.17 120.89 117.42 Fully in VRAM. Remarkably consistent across all prompts.
Qwen 2.5 14B 96.91 66.95 66.16 65.20 Drop after warm-up likely reflects context window growth.
Qwen 2.5 32B 10.84 8.23 8.10 8.23 Dense model. Stable, predictable, readable but not conversational.
Qwen3.5-35B-A3B 22.38 Reasoning model. Shows thinking process by default. Crashed on Test Prompt.
Llama 3.1 70B 1.28 1.84 1.37 1.68 Majority in system RAM. Batch use only.

LLM Eval Rate by Prompt

 

The prompt eval rate spikes dramatically on Prompts 2 and 3 for most models. That’s because by that point the context window already contains the previous prompts and responses. Ollama is reprocessing those cached tokens, and reading existing tokens is far faster than generating new ones.

Each model was warmed up, then prompts were run sequentially in the same session. The spikes are therefore consistent and expected, not anomalous.

The reason prompt eval rates are generally much higher than eval rates is worth understanding. Processing input and generating output are fundamentally different operations. During prompt evaluation, the model reads tokens that already exist. The GPU can process many of them in parallel because the full input sequence is known upfront. During generation, each new token depends on everything before it. That dependency forces sequential execution. You cannot parallelise the chain.

Prompt evaluation is a batch problem. Generation is a sequential one.

The eval rate, generation speed, is the practically meaningful number for day-to-day use. When you are waiting for a response, that is the figure you feel.

A note on performance mode: all benchmarks were run with the laptop plugged in and set to performance mode. Earlier informal testing in balanced mode produced dramatically different results, particularly on larger models where CPU-assisted RAM offloading is doing significant work.

One example: an early run of the open-ended “lizard” prompt took roughly 14 minutes to complete in balanced mode. Under performance mode, the same class of prompt behaves very differently. The difference between balanced and performance mode on sustained inference workloads is substantial.

The OMEN should be run plugged in, in performance mode, or your results will not reflect what this machine is actually capable of.

 

The KV Cache Tax: Why Context Length Has a Hidden Cost

When people discuss whether a model will fit on a given GPU, they usually focus on parameter count and quantisation. Those factors determine how much memory the model weights consume, but they are only part of the story.

During inference, large language models allocate an additional structure called the key-value cache, or KV cache. This cache stores intermediate token representations as they are processed, allowing the model to reference prior context efficiently during generation. Think of it as the model’s working memory for the current conversation, distinct from the long-term knowledge embedded in the weights themselves.

The KV cache scales with three things: the model’s hidden dimension and number of layers, the context window size, and the number of tokens generated. This means that even if the model weights fit comfortably within available VRAM, generation can still fail if the KV cache grows too large.

A model may initialise successfully and respond to a short test prompt, then crash when asked to produce a longer, more realistic response. The failure is not due to parameter count alone. It is memory pressure from an expanding KV cache combined with GPU offload behaviour.

This is exactly what occurred with the Qwen3.5-35B-A3B during testing.
It initialised without issue and completed the warm-up prompt,
then threw a 500 internal server error mid-response on every longer
benchmark prompt that followed. No recovery.

The MoE architecture loads the full expert set into memory at initialisation,
even though only a sparse subset is active per token. That creates a larger
and less predictable memory footprint than a dense model with comparable
active parameters, at least until runtimes optimise expert-specific loading
and offloading. Ollama and llama.cpp support for these newer MoE formats
is still maturing (early 2026).

The 70B dense model, by contrast, has entirely predictable memory allocation behaviour. Ollama can offload it cleanly and consistently. It is slower, but it is stable.

A 32B dense model at Q4 quantisation uses roughly 18GB at baseline. Load a 32K context window, roughly 40 pages of text, and the KV cache alone can consume an additional 3–4GB for conversation history. That comfortable fit becomes a squeeze very quickly.

During testing, 32B models hit their stride at 16–32K context windows. Push beyond that and you will either need more aggressive quantisation or accept RAM offloading and the performance hit that follows.

The takeaway is simple but important: context length and generation constraints can be as decisive as parameter size when determining whether a model is practically usable on consumer hardware.

It is like trying to remember every detail of a three-hour conversation without having taken notes. Eventually your brain starts skipping bits. The KV cache hits its ceiling and the model either slows dramatically or stops cooperating entirely.

LLM Profound Questions List and Response

 

Real-World Multitasking: Can You Still Do Your Job?

Benchmarks run in isolation are useful. They are also slightly fictional. In practice, nobody runs a local LLM on a machine with nothing else open.

The real question is simple: can you still write an email or work on a document while a prompt is running?

The honest answer required testing it properly.

I ran Qwen 2.5 32B via Ollama with Microsoft Edge open and the Ollama desktop app running in the background, the latter an accident I then deliberately left in place as it better reflects real-world conditions. Task Manager showed 19.1GB of 31.4GB RAM in use, committed memory at 35.4GB against a 41.9GB ceiling, GPU at 41°C, CPU at 29%.

Generation speed: 8.02 tokens per second.

Under clean benchmark conditions, the same model produced between 8.10 and 8.23 t/s. The difference is under 3%. For practical purposes, the machine was unbothered.

I ran the same prompt twice under slightly different memory conditions. First run: 61 input tokens, lighter RAM footprint, 7.89 t/s. Second run: 473 input tokens, Edge active, Ollama desktop running, 8.02 t/s. Generation speed did not drop under heavier load.

This machine does not punish you for using it like a normal person uses a laptop. That matters more than synthetic peak numbers.

Can you write an email while a 32B model is running? Yes. Document open?
Yes. Meaningful degradation in model performance? No.

One caveat: this holds in the 8B to 32B range, where most inference work remains GPU-bound with system RAM acting as overflow. For the 70B model, where the CPU is heavily involved in managing layer offloading, background applications competing for CPU cycles will have a more noticeable effect.

If you are running 70B, close what you do not need. For everything else, work normally.

 

Thermal Performance: The Unthrottled Advantage

Under sustained LLM load, the OMEN MAX 16 remained thermally composed. Testing was conducted in a room at 26°C ambient.

External surface temperatures, measured with a thermal camera, peaked at 46°C. Internal GPU temperature, monitored via Windows Task Manager, reached a maximum of 50°C. The heaviest workload in this test series was Llama 3.1 70B running continuously with active VRAM-to-RAM layer offloading.

26°C ambient. 46°C external surface. 50°C internal GPU.
Under sustained, non-gaming compute load.

For context, many laptops will push well beyond 70°C internally during extended high-intensity workloads, particularly when CPU and GPU are both engaged. Thermal throttling under sustained compute is common in thinner designs tuned primarily for burst performance.

That did not occur here, even allowing for the fact that Task Manager reports GPU die temperature rather than peak hotspot temperature.

Clock stability remained consistent throughout extended inference sessions. There was no observable thermal slowdown in token generation across runs with performance mode enabled and the system plugged in.

This matters more for AI workloads than for gaming. Gaming loads spike and dip. Large-model inference is relentless. It does not pause between frames.

The OMEN MAX 16 sustains that load without thermal collapse.

For users intending to run 8B to 32B models interactively, or even 70B models in batch workflows, thermal behaviour is not the limiting factor on the OMEN.The more interesting question is what the Blackwell architecture can do with that headroom.

HP OMEN MAX 16 Thermal Load running LLM

 

The 2026 Advantage: NVFP4 Quantisation

The RTX 50-series Blackwell architecture introduces support for NVIDIA’s NVFP4 format, a 4-bit floating-point representation designed specifically for accelerated inference workloads.

NVFP4 allows models to occupy less memory while preserving dynamic range characteristics that traditional integer quantisation compresses more aggressively. When supported by the runtime and model, this can translate into improved memory efficiency and potentially higher effective throughput.

Tooling support, however, is still maturing. Ollama and llama.cpp do not currently expose NVFP4 paths natively, and the benchmarks in this review therefore reflect standard quantisation methods rather than NVFP4-optimised inference.

The performance documented here represents what this hardware delivers today through current software stacks. NVFP4 represents architectural headroom. Models in the 30–34B range that are already practical on this machine may become more memory-efficient or faster as runtime support improves. The practical ceiling shifts upward without the hardware changing.

Not every model supports NVFP4. Not every runtime exposes it. But the architectural capability exists at the silicon level. This is not a speculative marketing bullet point. It is a forward-looking design decision that positions the RTX 5080 more favourably over the next several tooling cycles.

Architectural capability is only half the story, though. The other half is how you choose to use the models that run on it.

 

Strategy: When Smaller is Smarter

Bigger is not always better.

A 70B model offers stronger reasoning depth and broader language fluency. It is also slower, more memory-intensive, and less responsive in tight feedback loops.

For many real workflows, iteration speed matters more than theoretical capability.

Smaller models in the 7B to 14B range are faster, easier to constrain, and more predictable. They enable rapid refinement. They return results instantly. They stay within guardrails more reliably under strict instruction.

In practice, it is often more efficient to orchestrate specialised models in the 2B–7B range, each handling a defined task such as summarisation, extraction, or classification, than to rely on a single monolithic large model attempting to do everything. It is the difference between a focused team and an overburdened polymath.

During testing, the 8B and 14B models saw the most practical use. Not because they surpassed the 70B in raw capability, but because they matched the task. Quick iterations. Immediate responses. Predictable behaviour.

The 70B model has a clear role for deep reasoning or complex synthesis where latency is acceptable. It is powerful. It is not interactive.

The Qwen3.5-35B-A3B hints at where model architecture is heading: large total parameter counts with selective activation per token. In theory, that combines scale with efficiency. On this hardware today, it is not stable enough for production use. That will evolve as runtimes mature.

For now, the dense 8B to 32B range is where this machine delivers sustained, practical value.

 

Working With Multiple Models

Cloud AI encourages the idea that one model should handle everything. In practice, local workflows tend to look different.

Smaller models in the 7B–14B range are fast and well-suited to iteration. Outlining, summarising, and testing different ways to frame a prompt. Larger models in the 30B–70B range reason more carefully but generate output far more slowly. They are better deployed to review, challenge, or refine something that already exists.

Smaller models also become particularly useful in retrieval-based workflows. When a model is grounded in a well-structured document set (a process known as RAG, or Retrieval-Augmented Generation), raw parameter count often matters less than responsiveness and consistency. A smaller model paired with strong retrieval can outperform a much larger model operating without domain context.

In practice this becomes a workflow rather than a single prompt. A smaller model produces a structure or a draft. A larger model then interrogates or refines it. The goal is not to find a single model that does everything. It is to sequence them in a way that plays to their individual behaviour.

Hardware constraints reinforce this pattern. On a system with 16GB of VRAM, running multiple large models simultaneously is not practical. Switching between models for different stages of a task works well, and keeps the system within its memory limits.

The shift is architectural. Instead of asking one model to solve the entire problem, the process becomes about choosing the right tool for each stage.

 

Understanding Local LLM Behaviour: The Lizard Problem

Local LLM prameters explanationThe strategic difference between cloud and local AI becomes clear when you observe how local models behave without strong structural constraints.

In an early local test run, I used this prompt:

What is the most difficult question I could ask you, and why?

The model generated a detailed narrative about discovering a new species of lizard. Taxonomy. Habitat descriptions. Scientific controversy. Peer review drama. It was coherent, well structured, and completely irrelevant.

I read the entire thing. It was genuinely engaging. I almost wanted to know more about this fictional lizard.
It had nothing to do with the question.

This is drift.

When a prompt has no defined endpoint, the model defaults to a familiar narrative structure and commits to it. Coherence remains high. Relevance dissolves.

That particular run was via llama.cpp at default settings, which use a higher temperature than Ollama’s defaults. There was also no explicit token limit. The model had no reason to stop and no constraint guiding its direction.

The same prompt run in Ollama, using the same model but under more controlled parameters, produced a coherent, on-topic response.

The behaviour you observe depends as much on configuration as on model selection.

You rarely see this kind of drift on cloud platforms. ChatGPT, Claude, and Gemini operate behind layered alignment systems, internal stop sequences, and behavioural guardrails. Local models running via Ollama or llama.cpp expose the engine more directly.

That is both the point and the responsibility.

The fix is straightforward. Lower temperature to reduce randomness. Set a max token limit to enforce termination. Constrain output format. Insert explicit structural instructions.

These adjustments do not make the model more intelligent. They make it more bounded and lucid.
Understanding when to constrain versus when to let the model explore is what separates someone running LLMs from someone designing systems around them.

 

Practical Commands for Managing Drift

Drift is a configuration problem as much as a model problem. The tools already provide the controls. You just need to use them.

Ollama

Parameters can be adjusted interactively within a session using /set, or made permanent via a Modelfile.

#Start a model
ollama run llama3.1:8b
#Within the session, lower temperature for more focused responses
/set parameter temperature 0.3
#Set maximum output length to prevent runaway generation
/set parameter num_predict 500
#Nucleus sampling
/set parameter top_p 0.9

To make parameters permanent across sessions, create a Modelfile:

FROM llama3.1:8b
PARAMETER temperature 0.3
PARAMETER num_predict 500
PARAMETER top_p 0.9
#Then build and run your custom model
ollama create llama3.1:8b-controlled -f Modelfile
ollama run llama3.1:8b-controlled

 

llama.cpp

If you’re running llama.cpp directly, parameters are passed as CLI flags at launch. This is where the lizard came from.

#Lower temperature for more focused responses
./llama-cli -m model.gguf –temp 0.3
#Set maximum output length
./llama-cli -m model.gguf -n 500
#Combine multiple parameters
./llama-cli -m model.gguf –temp 0.3 –top-p 0.9 -n 500

 

Local Agents: Beyond Chatbots

The OMEN’s fast inference enables something more interesting than simple question-and-answer interactions: locally contained agents.

Recent events in the AI agent space have illustrated what happens when an autonomous system is granted broad permissions and asked to interpret loosely defined instructions. When deletion privileges, write access, and automation combine without sufficient containment, mistakes scale quickly. Those failures are architectural, not cognitive.

The distinction is not cloud versus local. It is unconstrained authority versus explicit boundaries.

If you followed the Moltbot (aka Clawdbot) incident, you’ve seen what happens when containment is an afterthought. A decade of emails gone in the blink of an eye. Efficient, yes. Recoverable, no.

A locally contained agent operates within boundaries you define. It cannot exceed the permissions of the operating system account you run it under. It cannot reach beyond your network unless explicitly configured to do so.

Containment is not about intelligence. It is about scope.

A local document agent can read, summarise, flag, and draft. It cannot irreversibly delete, transmit, or execute beyond the constraints you impose. The worst case is a response you discard, not a system-wide action you regret.

Beyond safety, there is performance.

Low-latency inference on Blackwell-class hardware enables agents that make dozens of rapid internal decisions: traversing file systems, analysing code, chaining multi-step workflows. Each step occurs locally. There is no network round trip, no API latency, no rate limits.

Over workflows involving hundreds of decisions, those milliseconds compound.

A developer using a local coding agent to refactor a codebase, for example, can read files, identify patterns, propose changes, and write updated code without external calls. The codebase remains local. Mistakes are reversible. Version control stays intact.

This is not merely about privacy. It is about control, reversibility, and architectural clarity.

When agents operate within boundaries you define, authority is deliberate rather than assumed. For professional use, that distinction matters.

 

What This OMEN Hardware Cannot Do

It is worth being explicit about the hard limits. This laptop is capable, but it is not magic.

Multiple large models simultaneously. Want a 30B coding model and a 20B analysis model running at the same time? Not happening. Within the 16GB VRAM envelope this system is designed for sequential workflows, not concurrent large-model inference.

Real-time video processing. Analysing video frames with LLMs requires different architecture. This GPU is optimised for inference, not computer vision workloads requiring real-time frame analysis.

Training models from scratch. Fine-tuning small models is possible. Training a 7B model from foundation weights is technically feasible but painfully slow. Training anything 30B or above requires data centre GPUs with NVLink, pooled VRAM, and proper cooling infrastructure. If you’re thinking you’ll just train your own GPT-4 competitor, I have some disappointing news.

Guaranteed sub-second responses on 70B models. Physics matters. If your workflow depends on instant responses from the largest models, you need cloud APIs or hardware with 48GB or more of VRAM.

All-day battery life under inference load. Under sustained LLM workload, you’re looking at 60-90 minutes, maybe. This is a plugged-in workstation, not a coffee shop laptop.

These aren’t complaints. They’re boundaries. Understanding them prevents disappointment and helps you architect workflows that match hardware reality.

 

The Portability Question: Battery, Weight, and Reality

The OMEN MAX 16 is technically a laptop. Whether it is portable depends on how honest you are about how you work.

Battery life varies significantly depending on workload.

Light document work and browsing: six to seven hours. Genuinely usable for a half-day away from a desk.

Sustained LLM inference with the GPU under load: sixty to ninety minutes in performance mode at full brightness. If you are running 32B or 70B models, plan to be near power.

Unplugged inference performance was not formally benchmarked. The OMEN is a plugged-in workstation and was tested accordingly.

But the most revealing number wasn’t either of those.

A Microsoft Teams call at 70% brightness in balanced mode drained 15% in 27 minutes. That translates to roughly two and a half to three hours of video conferencing on a full charge. For a professional workstation, that’s the metric that matters. The client demo. The conference room without a power outlet. The meeting that runs long.

The 330W power brick is larger than most laptop chargers you’ve owned. Combined with the laptop, you’re at close to 4kg. This is not an ultralight machine pretending to be one, and it doesn’t pretend to be.

Fan noise deserves a precise clarification.

During initial benchmarking, the laptop was placed on a glass table with unobstructed airflow. Under those conditions, it remained surprisingly unobtrusive. Later testing on my lap, partially restricting airflow, triggered sustained maximum fan speeds.

Measured peak noise was approximately 66dB. Clearly audible and intrusive in quiet environments.

Notably, the chassis remained cool throughout. The thermal priority is clear: performance stability over acoustic discretion.

This is a portable workstation in the same way high-end gaming laptops
are portable. It moves between locations and performs at full capability
once there. It works best plugged in.

If all-day untethered mobility is central to your workflow, Apple Silicon remains unmatched on that metric. If your priority is local AI performance at this scale, and you are comfortable working within reach of power, the trade-off is rational.

 

The Comparison Question: What About Apple Silicon?

This is the question every local AI discussion eventually reaches. What about a MacBook Pro with M4 Max and 128GB unified memory?

Honest answer: I haven’t tested one yet. But we can compare architectures and published behaviour.

 

Where Apple Silicon Excels

Capacity.

128GB of unified memory allows models that simply will not fit on a 16GB VRAM GPU. A 120B model at Q4 quantisation is feasible on a maxed-out Mac. It is not feasible here.

Battery life.

Apple’s efficiency per watt remains exceptional. If your priority is sustained AI workloads unplugged for hours at a time, Apple Silicon is materially ahead.

Larger context windows.

More available memory allows larger KV caches, which means longer conversation histories before memory pressure becomes an issue.

 

Where NVIDIA Blackwell Excels

Raw inference throughput.

CUDA-optimised inference on Blackwell-class GPUs delivers extremely strong token generation performance in the 8–34B range. On models that fit comfortably within VRAM, dedicated GPU memory and mature CUDA tooling typically produce higher generation speeds than unified memory architectures optimised for efficiency. As NVFP4 support matures, that headroom increases further.

Software ecosystem.

CUDA remains the dominant platform in AI development. Many frameworks, fine-tuning libraries, and research tools are optimised first for NVIDIA hardware. llama.cpp’s CUDA backend is mature. vLLM does not currently support Metal. If you work deeply in AI tooling, you will eventually encounter a “CUDA-only” constraint.

Cost per usable performance tier.

The OMEN and a 128GB unified memory Mac configuration occupy similar price territory. They are different bets: dedicated GDDR7 and CUDA throughput on one side; massive addressable memory and platform coherence on the other.

Upgradeability.

The OMEN’s RAM is user-serviceable. Apple’s unified memory is fixed at purchase. Your ceiling is defined the day you buy it.

The practical reality is this: if you need to run 120B-class models regularly, or if very large context windows are central to your workflow, Apple’s unified memory architecture is uniquely capable in a portable form factor.

If your work centres on the 8–34B range and prioritises generation speed, CUDA tooling maturity, and hardware flexibility, Blackwell-class GPUs offer a compelling trade-off.

They optimise for different constraints.

When I test an M4 Max directly, I’ll replace architecture-based inference with measured data. Until then, the decision comes down to whether you prioritise model capacity or inference throughput.

For most professional local AI workflows today, the 8–34B range is where iteration speed matters most. That range favours dedicated GPU acceleration.

That hypothesis is worth testing properly, and I intend to.

 

The Desktop Question: Why Not Build a Tower?

At $6,999 AUD, the natural question emerges: why not build a desktop instead? HP periodically runs specials, so it’s worth checking the HP Australia website before purchasing.

The honest answer is simple. You can build a faster machine for the same money. Whether you should depends on how you work.

For roughly the same budget, a desktop build with an RTX 5090 carrying 24GB of VRAM, or even dual RTX 5080s, plus 128GB of DDR5, superior sustained thermals and full upgrade flexibility would outperform this laptop. More VRAM. More RAM. Better cooling. Easier expansion. On raw performance per dollar, desktop wins.

I priced it out. Twice.

What you are paying for here is not peak performance. You are paying for form factor, flexibility, and mobility. The ability to run 32B and 70B-class models from a couch, a client office, or a conference room without rebuilding your workspace around a tower and monitor stack.

There is also power efficiency to consider. A desktop-class GPU under sustained load draws substantially more than this machine. Over time, that difference shows up on an electricity bill.

If your workspace is fixed, you never need to move your workstation, and you want maximum performance per dollar, build a desktop. The value proposition is objectively stronger.

If you work across locations, demo AI capabilities to clients, travel, or simply do not want a dedicated room consumed by a tower, the portability premium becomes rational. I work from my couch with the laptop on the coffee table and the dog beside me. That workflow matters more to me than squeezing an extra 10–20% of peak performance from a tower.

For consultants, researchers moving between institutions, developers splitting time between office and home, or anyone whose AI tooling needs to travel with them, this form factor has real value. For fixed-location power users, desktop wins on raw capability

The choice is not about which is better. It is about which constraint defines your life.

HP OMEN MAX 16 back panel

Who Should Buy This?

This laptop makes sense if:

You work with data that cannot leave your infrastructure: legal, healthcare, defence, or IP-sensitive R&D.

You value offline capability and do not want dependence on API availability.

You are comfortable with basic technical setup. Ollama lowers the barrier significantly, but this is still not a “one-click magic” experience.

You understand the trade-offs between model size, speed, and capability.

Your workflow centres on 8–34B models, where iteration speed matters more than sheer parameter count.

You occasionally need to move your workstation between locations without sacrificing meaningful local inference performance.

This laptop does not make sense if:

Your needs are fully met by cloud AI platforms. In that case, the web interface is faster, simpler, and economically rational.

Your workflow depends on interactive 70B-class models.

You expect local AI to feel identical to cloud AI. It does not, and that difference is both the cost and the benefit.

You are unwilling to learn foundational LLM concepts like quantisation, context windows, and sampling.

All-day unplugged battery life is central to your work.

Your workstation never moves and maximum performance per dollar is your priority. In that case, a desktop build is the better tool.

 

Conclusion: The Hardware is Ready

The HP OMEN MAX 16 proves something important: local AI in 2026 is no longer theoretical. It is operational.

Within the 8B to 32B range, this machine delivers fast, stable, genuinely usable performance. It handles sustained inference without thermal instability and offloads predictably when required. It rewards users who understand context windows and quantisation rather than obscuring those mechanics.

The 70B tier runs. It does not converse fluidly. But it is viable for batch reasoning and structured workloads. That distinction matters.

This is not a laptop for curiosity. It is a laptop for intention.

If your work requires control over data, predictable performance, and the ability to experiment without API boundaries, the hardware is ready. The discipline is the variable.

If you need 120B-scale capacity, unified memory architectures are the right tool. If you need 24GB+ VRAM and never move your workstation, build a tower.

But if your workflows live in the 8B–34B band, and you value portability without surrendering meaningful capability, this machine is a credible professional instrument.

Local AI remains a discipline. It rewards people who understand what they are running and why. The OMEN does not remove that responsibility. It makes that responsibility viable on consumer hardware.

That is not cosplay.

That is infrastructure.

DRN would like to thank HP for providing the hardware for this review.

 

HP OMEN MAX 16: Key Specifications (As tested here)

 

Component Specification
CPU Intel Core Ultra 9 275HX
GPU NVIDIA GeForce RTX 5080
VRAM 16GB GDDR7
System RAM 32GB DDR5-5600 (user-upgradeable to 64GB via SO-DIMM)
Storage 1TB SSD
Display 16″ WQXGA OLED (2560×1600), 48-240Hz, 0.2ms response time, 100% DCI-P3
Battery 83 Wh, 6-cell Li-ion polymer; fast charge to 50% in approximately 30 minutes
Weight 83 Wh, 6-cell Li-ion polymer; fast charge to 50% in approximately 30 minutes
Ports 2x Thunderbolt 4 (USB-C 40Gbps, DisplayPort 2.1, USB PD 3.1); 2x USB-A 10Gbps; 1x HDMI 2.1; 1x RJ-45; 1x headphone/mic combo
Wireless Wi-Fi 7 (Intel BE200), Bluetooth 5.4
Webcam HP True Vision 1080p FHD IR camera with temporal noise reduction
Cooling OMEN Tempest Cooling PRO; Cryo Compound liquid metal on CPU and GPU; vapour chamber
Operating System Windows 11 Home
Price $6,999 AUD (RRP)
LLM Models Tested Ollama and llama.cpp; Llama 3.1 8B, Qwen 2.5 14B, Qwen 2.5 32B, Qwen3.5-35B-A3B, Llama 3.1 70B

 

Glossary of AI and Hardware Terms

Context Window

The maximum amount of text a model can consider at once. It includes the prompt, conversation history, and the model’s generated reply. Larger context windows allow models to work with longer documents or extended conversations.

CUDA

NVIDIA’s software platform for running accelerated workloads on GPUs. Many AI frameworks and inference tools are optimised for CUDA, which is why NVIDIA hardware remains common in AI development environments.

Eval Rate (tokens/sec)

The speed at which a model generates new tokens during inference. Higher values mean faster responses and more interactive model behaviour.

Inference

The process of running a trained AI model to generate a response. In this article, inference refers to running local language models on consumer hardware rather than querying a remote cloud API.

KV Cache (Key–Value Cache)

A memory structure used by transformer models to store previously processed tokens. It allows models to avoid recalculating earlier parts of a sequence but consumes additional VRAM as conversations grow longer.

LLM (Large Language Model)

A neural network trained on large amounts of text to generate or analyse language. Examples used in this review include models in the 8B–70B parameter range.

Model Parameters (B = billions)

The internal weights that define a neural network’s behaviour. Larger models (for example, 32B or 70B) typically have stronger reasoning and knowledge capabilities but require more memory and compute to run.

MoE (Mixture of Experts)

A model architecture that activates only a subset of its internal components for each token. This allows very large models to behave like smaller ones during inference while retaining the capacity of a larger system.

Nucleus Sampling (top_p)

A sampling method that limits the probability distribution from which tokens are selected. It helps balance creativity and coherence in generated responses.

NVFP4

A 4-bit floating-point format introduced with NVIDIA’s Blackwell architecture. It is designed to accelerate AI inference while maintaining higher output quality than traditional ultra-low precision formats.

Ollama

A tool that simplifies running local language models on personal hardware. It manages model downloads, execution, and configuration for local inference workflows.

Offloading

The process of moving part of a model from GPU memory to system RAM when VRAM capacity is exceeded. Offloading allows larger models to run but typically reduces performance.

Quantisation

A technique that reduces the precision of model weights (for example from 16-bit to 4-bit) to decrease memory usage and increase inference speed. Most consumer systems rely on quantised models to run larger LLMs locally.

Sampling Parameters

Settings that control how a model generates text, including temperature, top_p, and token limits. These influence randomness, creativity, and response length.

Temperature

A parameter that controls randomness in text generation. Lower values produce more predictable responses, while higher values increase variation and creativity.

Token

A unit of text processed by a language model. Tokens may represent whole words, parts of words, or punctuation. Both prompts and responses are measured in tokens.

VRAM (Video Memory)

Memory located on a GPU. It stores model weights, KV cache data, and intermediate computations during inference. VRAM capacity is often the primary constraint when running local LLMs.

Unified Memory

A memory architecture used by Apple Silicon where CPU, GPU, and AI accelerators share the same memory pool. This allows very large models to run in some cases but may trade off raw GPU throughput.