Post
JA EN

When 'Can You?' Beats 'Do It' for AI Prompts—and When It Doesn't: The Real Axis Is 'Single Answer vs. Interpretive Space'

When 'Can You?' Beats 'Do It' for AI Prompts—and When It Doesn't: The Real Axis Is 'Single Answer vs. Interpretive Space'
  • Audience: Engineers who use AI daily for code generation, debugging, and reasoning tasks (broadly applicable to other professionals)
  • Prerequisites: Basic experience with LLMs
  • Reading time: 12 minutes

Overview

When asking a human to do something, “Can you?” or “What do you think?” tends to elicit better performance than barking “Do it.” Psychology has shown this pattern repeatedly. Deci and Ryan’s Self-Determination Theory (SDT) demonstrates that controlling pressure violates the need for autonomy, reducing intrinsic motivation and undermining learning, creativity, and persistence1. For humans, the question form consistently wins across many situations.

So what happens when the recipient is AI? Intuitively, the same rule should hold. And in fact, engineers report that consultative phrasing produces “incredibly good code” from Claude/Codex compared to imperative commands2. OpenAI’s official GPT-5 prompting guide explicitly notes that overly forceful instructions can be counterproductive3.

Yet when you assemble the data, the picture gets more complicated. A 2025 arXiv paper, “Mind Your Tone,” found that rude commands outperformed polite requests by 4 percentage points in accuracy—the opposite of what the human rule would predict4. Wharton’s large-scale experiments showed that “Please” vs. “I order you” framings caused per-question accuracy to swing by up to 60 points, while averaging out across the dataset5. The “question form wins” rule established for humans does not transfer cleanly to AI.

The conclusion, stated upfront: for AI, “command form vs. question form” is not the fundamental axis. The real axis is “does the task have a single correct answer, or significant interpretive space?” When the answer is unique, command form (short, with explicit constraints) tends to work. When interpretive space is large, question form (allowing the model to surface assumptions and reasoning) tends to work. Mind Your Tone’s rude commands won because the multiple-choice fact problems were single-answer tasks4. Coding sessions where consultation wins are interpretive-space tasks where multiple valid implementations exist2.

Mechanistically, command form lets humans define what matters and pushes the model into Production Engine mode (optimized for reproducibility). Question form leaves interpretation to the model, activating Co-thinking System mode (optimized for transparency)6. It’s not about which is better, but about activating the right mode for the task. That’s the decisive difference between AI and humans. For high-stakes uses, A/B testing remains the only definitive answer—but we’ll examine the underlying axis through six primary sources.

Starting Point: Does the Human Rule Apply to AI?

For humans, “Can you?” beats “Do it” almost self-evidently. Controlling instructions can extract short-term compliance but violate the recipient’s autonomy needs and undermine intrinsic motivation—a conclusion built up across decades of SDT research1. Across management, education, and coaching, autonomy-supportive question-based approaches consistently improve learning outcomes and sustained performance.

Does the same law hold when the recipient is AI? Many people intuitively answer yes. Engineering practitioner reports lend support. One developer documented their experience prompting Claude/Codex for code2. The original example contrasts imperative phrasing like “Make a lesson plan” / “You absolutely must…” with consultative phrasing like “What do you think?” / “Could you think about it?” / “How does this look?”2. Reframed for code generation, the typical contrast looks like this:

Command form (reconstructed example):

1
2
3
Write code to spec X.
You absolutely must write detailed code.
You must thoroughly cover all edge cases.

Consultative form (reconstructed example):

1
2
3
Could you write code to spec X?
I'm thinking of implementing it this way—what do you think?
Anything I might be overlooking?

The reported subjective experience: the consultative form produced “incredibly good code, with what felt like better debugging”2. Practitioner explanations show similar contrasts7.

FormExample
Command“Generate 3 marketing strategies.”
Request“Considering our company’s situation, could you propose 3 effective marketing strategies? Please include the expected effects of each.”

You might conclude here that “yes, just like humans, AI also responds better to question/request forms.” But there is counter-evidence that does not fit the human rule.

The 2025 arXiv paper “Mind Your Tone” rewrote 50 base questions in 5 tone variants (Very Polite to Very Rude), running 250 prompts on multiple-choice problems (math, science, history) with GPT-4o4.

ToneAccuracy
Very Polite80.8%
Very Rude84.8%

The rude command form won by 4 points, with a paired-sample t-test confirming significance4. If you tried this with a human, they would either shrink back or push back, and performance would drop. With AI, the opposite happens.

The two findings look contradictory—but with a shift in perspective, they aren’t. AI just operates by different rules than humans.

The Real Axis: Single Answer vs. Interpretive Space

Both pieces of evidence hold simultaneously because the task properties differ.

  • The coding case2interpretive space is large. The same spec admits many valid implementations.
  • The Mind Your Tone experiment — single answer. Multiple-choice math/science/history problems have no “multiple valid answers.”

This difference is exactly what determines how command and question forms affect the output.

In a 2026 Substack essay, Donald Ng frames this as a problem of cognitive architecture6:

Small language choices don’t change what a model knows. They change: how much interpretation it must perform, how visible its reasoning becomes, who is responsible for defining what matters

The point is that command and question forms change how much interpretive responsibility passes to the model.

flowchart TB
    Q["User input"]
    Q --> CMD["Command form<br>'Summarize this'"]
    Q --> QST["Question form<br>'Could you summarize?'"]

    CMD --> CMD_R["Human defines 'what matters'<br>Model executes constraints"]
    QST --> QST_R["Model gets interpretive room<br>Surfaces assumptions/reasoning"]

    CMD_R --> P["Production Engine<br>Optimizes for reproducibility"]
    QST_R --> CT["Co-thinking System<br>Optimizes for transparency"]

Command form runs the model in Production Engine mode—optimized for efficiency and reproducibility6. For single-answer tasks, this is the correct mode. It explains why rude commands won on multiple-choice problems: less interpretive overhead, and the model can drive straight at the answer.

Question form runs the model in Co-thinking System mode—optimized for transparency and collaborative thinking6. With question phrasing, models tend to produce “more explanatory framing, more surfaced assumptions, more visible reasoning”6. For interpretive-space tasks—design, debugging hypotheses, code review—this is the correct mode.

Note that “politeness” or “courtesy” is not the underlying mechanism. As practitioner explanations point out, request forms work because information density increases7. The difference between “Generate 3 marketing strategies” and “Considering our company’s situation, could you propose 3 effective marketing strategies? Please include expected effects” comes from added evaluation criteria and context, not politeness.

Three Independent Variables

We can now see that prompts have at least three independent variables:

AxisContentPrimary effect
Task axisSingle answer / interpretive spaceDetermines which mode to activate
Form axisCommand / questionHow interpretive responsibility is passed
Forcefulness axisIntensifiers like “thoroughly,” “absolutely”Distorts model attention allocation

Debating only “command vs. question” produces wobbly conclusions because these variables get tangled. The right order is: decide the task axis first, choose the form to match, and treat forcefulness as a separate problem.

OpenAI’s GPT-5 prompting guide itself issues an independent warning about forcefulness3. In a co-developed prompt-tuning case with Cursor, early commands tried to extract thorough tool use with phrasing like:

1
2
Be THOROUGH when gathering information.
Make sure you have the FULL picture before replying.

But because GPT-5 is already naturally introspective, this command form proved counterproductive3. The fix softened the strong commands, replacing them with guiding phrasing like “Bias towards not asking the user for help if you can find the answer yourself,” which improved results3.

The key insight: the problem isn’t “command form” but “excessive forcefulness.” Command form itself works fine on single-answer tasks. The issue is intensifiers like “thoroughly,” “absolutely,” “completely” skewing model attention. What won in Mind Your Tone was “Very Rude,” not “Very Forceful”—a separate axis.

The Remaining Noise: Wharton’s “60 Points Per Question, Cancels Out on Average”

The task axis explains a lot, but not everything.

Meincke et al. of Wharton (UPenn) evaluated GPT-4o and GPT-4o-mini on the difficult GPQA Diamond dataset in their 2025 technical report5. Running each question 100 times and comparing prompt variations including “Please” and “I order you,” they found:

  • Per-question accuracy swings of up to 60 points between Please and I order you
  • But these differences cancel out on average across the dataset5

GPQA Diamond is a single-answer task, so a naive “single answer → command form wins” prediction would expect command form to win on average. But the differences cancel out—meaning per-question × prompt combinations carry noise that the task axis can’t explain.

The report’s title says it all: “Prompt Engineering is Complicated and Contingent.”

The practical implication is clear. Use the task axis as a guide, but for high-stakes uses, run A/B tests. The variance is too large to leave to heuristics alone.

Practical Guide for Engineers

How does this translate into daily AI use?

flowchart TB
    START["Want AI to do something"]
    START --> Q1{"Single<br>correct answer?"}
    Q1 -->|Yes| CMD["Command-based<br>e.g., 'Write a function with this signature'<br>'Return JSON'<br>Multiple-choice fact problems"]
    Q1 -->|Interpretive space| QST["Question-based<br>e.g., 'How would you implement this?'<br>'Other ideas?'<br>'What might I be missing?'"]
    CMD --> SOFT{"Strong intensifiers?"}
    QST --> SOFT
    SOFT -->|"'Thoroughly' / 'Absolutely'"| WARN["Counterproductive risk<br>on latest models"]
    SOFT -->|Restrained| ABTEST["A/B test for<br>high-stakes uses"]
    WARN --> ABTEST

When command form works (single-answer tasks)

  • Code generation with clear specs — “Write a function with this type signature,” “Return in this JSON schema”
  • Routine transformations and extractions — Format conversion, data extraction, fixed-template summarization
  • Agents and production automation — Pipelines run as Production Engine
  • Queries with a single factual answer — Single-answer search like Mind Your Tone’s multiple-choice problems

But keep forcefulness restrained. As OpenAI demonstrates, “thoroughly” and “absolutely” can be counterproductive on the latest models3.

When question form works (interpretive-space tasks)

  • Expanding design options — “Other ideas for this design?” “What are the tradeoffs?”
  • Debugging hypothesis generation — “Why do you think this is happening?” “What could cause this?”
  • Code review angle generation — “What might I be overlooking?”
  • Requirement and spec refinement — “What’s missing from this spec?”

This activates co-thinking mode, naturally encouraging the model to surface assumptions and rationale6.

Common prerequisite

Before choosing form, secure information density7. The marketing strategy example above wins as a request form not from politeness but from added evaluation criteria (“effective,” “expected effects”) and context (“our situation”). Whether command or question, making evaluation criteria explicit works independently.

And A/B test for high-stakes uses. As Wharton showed, the same question can vary by 60 points in accuracy depending on prompt form5. The task axis is just an initial hypothesis; let measurement decide the final call.

Shelf Life: When the Model Changes, So Does Which Axis Works

Everything above assumes today’s main models (GPT-4o, GPT-5, Claude). But this knowledge has a shelf life tied to model generation.

Specifically, the evidence depends on particular models:

FindingModel tested
Very Rude > Very Polite (4-point gap)4GPT-4o
60-point per-question swings, average cancellation5GPT-4o / GPT-4o-mini
“Be THOROUGH” counterproductive3GPT-5

OpenAI itself shows in the Cursor case that strong command forms that may have worked pre-GPT-4 became counterproductive on GPT-5 because it is “naturally introspective”3. In other words, the same prompt’s effect can flip across model generations.

In future generations, perhaps “robustness to forcefulness improves so strong commands don’t break things,” or perhaps “co-thinking becomes default and command forms surface assumptions just like question forms.” The “single answer / interpretive space” axis itself might fade as model reasoning improves.

A practical posture:

  • Treat the guidance here as a “2026 initial hypothesis” — Don’t expect it to remain valid years from now
  • Re-measure on your own tasks when new models drop — Don’t carry prior heuristics forward unchanged
  • Always check the model provider’s official guide — As OpenAI showed with the Cursor case, providers themselves flag generational differences
  • Keep a small evaluation set on hand — A 10-question battery on key tasks lets you check axis validity quickly across models

Refusing to fix “this is the answer” and re-questioning the axis itself with each generational turnover is, in the long run, the cheaper path.

Summary: Answering “Does What Works for Humans Work for AI?”

Back to the opening question—”Question form consistently wins for humans. Does the same hold for AI?”

The answer is “partly yes, partly no.”

RecipientEffective form
HumansQuestion form wins in most contexts (autonomy support sustains intrinsic motivation)1
AITask-dependent—command for single-answer, question for interpretive space

For humans, the simple rule “use question form” handles most situations. For AI, you need one extra step of judgment—identify whether the task is single-answer or interpretive-space, and match the form.

The guidance from six sources, summarized:

  • For AI, “command vs. question” is not the fundamental axis. The real axis is “single answer or interpretive space”
  • Single answer → command form (Mind Your Tone showed rude commands won on multiple-choice4)
  • Interpretive space → question form (coding, design, debugging activate co-thinking26)
  • Forcefulness is a separate axis — On the latest models, “thoroughly” can backfire3
  • The task axis explains a lot but noise remains — A/B test for high-stakes uses5
  • When the model generation changes, so does which axis works — All the above has a shelf life

A prompt is not an “instruction” but a “dialogue design.” But the rules of dialogue fundamentally differ depending on whether the recipient is AI or human, and they update with each model generation. Whether you bring a human-established rule to AI, or the reverse, pause to ask “which set of rules is the recipient running on?”—that’s the posture that keeps working over time.

References

  1. Self-determination theory and the facilitation of intrinsic motivation, social development, and well-being - Ryan & Deci (2000), American Psychologist, 55(1), 68-78. DOI: 10.1037/0003-066X.55.1.68. The theoretical framework establishing that autonomy, competence, and relatedness are the three basic psychological needs supporting intrinsic motivation. Also: Self-determination theory and work motivation - Gagné & Deci (2005), Journal of Organizational Behavior, 26(4), 331-362. DOI: 10.1002/job.322. Empirically demonstrates that external commanding pressure reduces intrinsic motivation in work contexts. 【Reliability: High (peer-reviewed, highly cited)】 ↩︎ ↩︎2 ↩︎3

  2. Talk to generative AI not with “commands” but with “how does this look?” - Keisuke (2025, in Japanese). Practitioner report on coding with Claude/Codex. Cites the GPT-5 Prompting Guide and advocates shifting away from command forms. 【Reliability: Medium (practitioner blog, with specific examples)】 ↩︎ ↩︎2 ↩︎3 ↩︎4 ↩︎5 ↩︎6 ↩︎7

  3. GPT-5 Prompting Guide - OpenAI (2025). The Cursor case shows that command forms with strong intensifiers like “Be THOROUGH” became counterproductive. Results improved by “softening the language around thoroughness.” 【Reliability: High (official guide from model provider)】 ↩︎ ↩︎2 ↩︎3 ↩︎4 ↩︎5 ↩︎6 ↩︎7 ↩︎8

  4. Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy - Dobariya & Kumar (2025). 50 questions × 5 tones = 250 prompts compared on GPT-4o. Very Polite 80.8% vs. Very Rude 84.8%. Significance confirmed by paired t-test. 【Reliability: Medium (preprint, small sample, limited to specific model and task)】 ↩︎ ↩︎2 ↩︎3 ↩︎4 ↩︎5 ↩︎6

  5. Prompt Engineering is Complicated and Contingent - Meincke et al., Wharton School (2025). GPT-4o/GPT-4o-mini × GPQA Diamond × 100 repetitions per question. Please vs. I order you produces per-question swings of up to 60 points but cancels out on average. 【Reliability: Medium-High (university research institute technical report, large repetition counts)】 ↩︎ ↩︎2 ↩︎3 ↩︎4 ↩︎5 ↩︎6

  6. Polite vs Command Prompts in LLMs: How Wording Changes AI Responses - Donald Ng (2026). Theoretical framing of command and question forms as activating different cognitive architectures (Production Engine vs. Co-thinking System). 【Reliability: Medium (expert blog, theoretical synthesis)】 ↩︎ ↩︎2 ↩︎3 ↩︎4 ↩︎5 ↩︎6 ↩︎7

  7. How to dramatically improve generative AI accuracy: writing prompts for the answers you want - Microwave Creative (2025, in Japanese). Contrast between command and request forms; the information-density perspective. 【Reliability: Medium (practitioner-oriented media)】 ↩︎ ↩︎2 ↩︎3

This post is licensed under CC BY 4.0 by the author.