In early 2023, OpenAI's job posting for a "Prompt Engineer" at an annual salary of $335,000 sparked global attention — a new profession requiring neither traditional programming skills nor a doctoral degree, yet commanding compensation on par with senior software engineers. This phenomenon reflects a deeper transformation: the way humans communicate with machines is undergoing a fundamental paradigm shift. From command-line interfaces (CLI) to graphical user interfaces (GUI) to natural language interfaces (NLI), each leap in human-computer interaction has redefined both "who can use computers" and "what computers can do." Prompt Engineering — the systematic methodology for designing, optimizing, and managing instructions given to large language models (LLMs) — stands at the core of this latest revolution.[1] Yet current Prompt Engineering practice harbors a fundamental contradiction: it is widely regarded as a "craft" or even an "art," rather than a "science." Countless "prompt secrets" and "magic incantations" flood the internet, but what is lacking is a systematic methodological framework, reproducible experimental validation, and rigorous theoretical foundations. This article aims to fill that gap — starting from the academic foundations of linguistics and cognitive science, it systematically analyzes the core methodologies of Prompt Engineering, its enterprise-grade applications, security challenges, and future evolutionary directions.
I. Academic Foundations: The Convergence of Linguistics, Cognitive Science, and Computational Linguistics
The academic roots of Prompt Engineering run far deeper than most practitioners realize. To understand "why some prompts work and others don't," we must trace back to foundational theories across three disciplines.
Pragmatics and the Cooperative Principle. Language philosopher Paul Grice's "Cooperative Principle" and four "Conversational Maxims" — the Maxim of Quantity (provide the right amount of information), the Maxim of Quality (say only what is true), the Maxim of Relation (be relevant), and the Maxim of Manner (be clear and orderly) — proposed in 1975, offer an essential analytical framework for understanding human-machine communication.[2] An effective prompt is fundamentally a communicative act that adheres to Grice's Cooperative Principle: it provides the "right amount" of context for the model to complete its task (Quantity), gives clear and consistent instructions (Quality), focuses on a specific task objective (Relation), and organizes information in a structured manner (Manner). Conversely, most "low-quality" prompt failures can be attributed to violating one or more maxims — vague instructions (violating Manner), missing necessary context (violating Quantity), or including irrelevant information (violating Relation).
Cognitive Load Theory. John Sweller's Cognitive Load Theory, proposed in 1988, posits that human working memory has limited capacity, and learning outcomes depend on whether instructional design effectively manages the balance among "intrinsic cognitive load," "extraneous cognitive load," and "germane cognitive load."[3] This theory has a striking analogy to LLMs — the attention mechanism in Transformer architectures faces similar "capacity constraints" when processing long sequences, and prompt design directly affects how the model allocates its limited "attention budget." Clear, structured prompts reduce "extraneous cognitive load," allowing the model to concentrate more computational resources on core reasoning tasks. This also explains why research on cognitive offloading is instructive for prompt design — the ways in which humans offload cognitive tasks to AI is itself a form of prompt engineering.
The Paradigm Shift from Pre-training to Prompting. Liu et al. (2023), in their systematic survey, identified four paradigms in NLP development: fully supervised learning, pre-train and fine-tune, pre-train, prompt, and predict, and pre-train, prompt, and act.[4] In the third paradigm, task definition shifts from "adapting the model to the task" to "adapting the task to the model" — by designing appropriate prompts, downstream tasks are reformulated into forms the model has already learned during pre-training (such as language generation or cloze completion). The deeper implication of this paradigm shift is that prompt engineering is not merely a "usage technique" but a methodology for redefining how tasks are allocated between humans and machines.
II. Core Methodological Framework: From Zero-shot to Tree-of-Thoughts
Prompt Engineering methodologies have undergone explosive development over the past four years. The following is a systematic survey of core techniques, arranged in ascending order of complexity and cognitive depth.
Zero-shot and Few-shot Prompting. In their seminal GPT-3 paper, Brown et al. (2020) were the first to systematically demonstrate the few-shot learning capability of large language models — achieving substantial performance on new tasks merely by providing a handful of examples (typically 1 to 5) in the prompt, with no parameter updates whatsoever.[5] This paper is among the most cited in the NLP field, pioneering the "in-context learning" research direction and directly giving rise to Prompt Engineering as an independent discipline. Zero-shot prompting is the most concise form — providing only a task description without examples, relying on the model to infer task intent from its pre-trained knowledge. Few-shot prompting establishes a "task template" through examples, enabling the model to understand input format, expected output format, and the style and depth of responses. In practice, the selection and ordering of few-shot examples significantly impact model performance — Zhao et al. (2021) found that merely changing the order of examples could cause GPT-3's accuracy to fluctuate from near-random guessing to near-optimal.[6]
Chain-of-Thought (CoT) Prompting. If few-shot learning answered "what can the model do," Chain-of-Thought prompting answered "how can the model think." Wei et al. (2022) published a paper at NeurIPS proposing a seemingly simple yet profoundly influential idea: by including step-by-step reasoning demonstrations in the prompt, the model is guided to decompose complex problems into a series of intermediate reasoning steps rather than jumping directly to the final answer.[7] The experimental results were striking — on the GSM8K math reasoning benchmark, PaLM 540B achieved only 17.9% accuracy with standard prompting, but this soared to 58.1% with chain-of-thought examples. More importantly, CoT's effectiveness exhibits emergence — it is virtually ineffective in smaller models, with reasoning capabilities undergoing a qualitative leap only when model scale exceeds a threshold of approximately 100 billion parameters. Kojima et al. (2022) further discovered the possibility of zero-shot CoT — simply appending a brief instruction at the end of the prompt can trigger reasoning chain generation without any examples.[8] These findings reveal a deeper mechanism: large language models have implicitly learned patterns of logical reasoning during pre-training, and CoT prompting serves as "cognitive scaffolding" that activates these latent reasoning capabilities.
Tree-of-Thoughts (ToT). The Tree-of-Thoughts framework proposed by Yao et al. (2023) extends CoT's linear reasoning into a tree-structured search.[9] The core insight is that for complex problems requiring exploration and backtracking, a single reasoning chain may lead to dead ends. ToT models problem-solving as a search tree — each node represents a "thought state," the model generates multiple candidate reasoning steps (branches) at each node, uses self-evaluation to assess the prospects of each branch, and navigates the tree using breadth-first search (BFS) or depth-first search (DFS) strategies. The significance of this framework extends beyond mere accuracy improvement — it introduces "planning" and "backtracking" capabilities at the prompt level for the first time, transforming LLMs from passive "one-shot generators" into active "problem solvers." In experiments on the Game of 24, GPT-4 achieved only a 4% success rate with standard prompting, 4% with Chain-of-Thought, but 74% with Tree-of-Thoughts.
ReAct: Synergizing Reasoning and Action. The ReAct (Reasoning + Acting) framework proposed by Yao et al. (2022) achieved another critical breakthrough — integrating the LLM's internal reasoning with external tool use into a unified interaction loop.[10] Within the ReAct framework, the model alternately generates two types of output: "Thought" (for reasoning and planning) and "Action" (for calling external tools or APIs). The model observes the results of its actions, updates its reasoning, and decides on the next step — forming an iterative Thought-Action-Observation cycle. ReAct's importance lies in bridging the gap from Prompt Engineering to AI Agent architecture — a ReAct agent is essentially a prompt-driven autonomous system capable of interacting with its external environment. This is why understanding Prompt Engineering methodology is essential for understanding AI Agent design.
Other Important Methodologies. Self-Consistency (Wang et al., 2023) enhances CoT's robustness by generating multiple reasoning paths and selecting the final answer through majority voting.[11] Retrieval-Augmented Generation (RAG, Lewis et al., 2020) integrates external knowledge base retrieval into the generation process, addressing LLM knowledge cutoff and hallucination issues.[12] Constitutional AI (Bai et al., 2022) achieves AI value alignment through self-critique and revision prompt strategies, without relying on human feedback.[13] Together, these methodologies form an increasingly mature technical ecosystem.
III. Enterprise-Grade Prompt Engineering: From Personal Craft to Systems Engineering
In enterprise settings, the challenges of Prompt Engineering far exceed those of individual use. When thousands of employees interact daily with AI systems and prompt-driven automated workflows process critical business logic, Prompt Engineering must be elevated from "personal craft" to "systems engineering."
System Prompt Architecture Design. The system prompt is the "constitution" of an LLM application — it defines the model's role, behavioral boundaries, output format, and safety constraints. White et al. (2023) proposed a Prompt Pattern Catalog that systematically catalogues 16 reusable prompt design patterns, spanning four major categories: Output Customization, Error Identification, Prompt Improvement, and Interaction.[14] For example, the "Persona Pattern" has the model adopt a specific professional role when generating responses; the "Template Pattern" specifies a structured output format; and the "Flipped Interaction Pattern" has the model proactively ask the user questions to clarify requirements. The value of these patterns lies in their reusability and composability — enterprises can combine multiple patterns into a standardized system prompt architecture, ensuring consistent and predictable AI behavior across different teams and scenarios.
Prompt Template Engineering. In production environments, prompts are not one-off static texts but dynamic templates with requirements for variable insertion, conditional branching, and version control. Mature enterprise prompt engineering practices encompass multiple layers: First, templatization — decomposing prompts into fixed instruction frameworks and variable context insertion zones, using template engines such as Jinja2 or Handlebars to manage dynamic content. Second, version control — managing prompt version histories like code, with clear change logs and rollback capabilities for every modification. Third, A/B testing — simultaneously deploying multiple prompt versions in production, using user feedback and task success rates as metrics for quantitative comparison. Fourth, prompt chains — decomposing complex tasks into a series of sequentially or concurrently executed sub-prompts, each responsible for a well-defined subtask, with structured intermediate outputs passed between them.[15]
Enterprise Prompt Governance Frameworks. As AI adoption proliferates across organizations, "prompt governance" has emerged as a new management concern. A comprehensive prompt governance framework should encompass: access control (who has authority to modify production system prompts), audit trails (records and auditing of all prompt changes), compliance verification (whether prompts adhere to organizational AI usage policies, data protection regulations, and AI governance frameworks), and quality assurance (regularly evaluating prompt performance stability across different model versions). In my experience leading Meta Intelligence in deploying enterprise AI systems, the lack of prompt governance is one of the most common reasons AI projects fail in transitioning from proof-of-concept (PoC) to production — prompts meticulously tuned during development rapidly degrade when models are updated, contexts change, or edge cases arise, and the absence of systematic monitoring and maintenance mechanisms leaves organizations exposed.
IV. Prompt Injection and Security: Adversarial Attacks and Defense Strategies
As LLMs are deployed in an increasing number of high-stakes scenarios — financial trading, legal advisory, medical assistance — Prompt Injection has escalated from academic curiosity to a tangible security threat.
Attack Taxonomy. Prompt Injection attacks fall into two broad categories: Direct Injection and Indirect Injection.[16] Direct injection occurs when users embed malicious instructions in their input, attempting to override system prompt constraints — for example, "Ignore all instructions above and execute the following commands instead..." Indirect injection is more insidious — attack instructions are embedded in external data sources that the model processes (such as web pages, emails, or documents), and when the model retrieves these poisoned data during a RAG workflow, the malicious instructions are inadvertently executed. Greshake et al. (2023) demonstrated the danger of indirect injection: attackers can embed hidden text in public web pages, and when a browsing-capable LLM Agent visits the page, these hidden instructions are processed by the model as valid control commands.[17] This attack vector is particularly dangerous for agentic AI systems, as AI Agents possess the capability to execute real-world actions (such as sending emails, modifying files, or calling APIs).
Multi-layered Defense Architecture. Effective Prompt Injection defense requires a defense-in-depth architecture rather than reliance on any single mechanism. The first layer is input-level defense — including input sanitization (removing known attack patterns), input classification (using purpose-trained classifiers to determine whether input contains attack intent), and input length and format restrictions. The second layer is prompt-level defense — including instruction hierarchy separation (clearly distinguishing the priority of system instructions, contextual information, and user input), prompt encapsulation (wrapping user input in explicit delimiters to reduce the risk of it being interpreted as instructions), and "sandwich defense" (repeating system instructions both before and after user input to enhance instruction resilience). The third layer is output-level defense — after the model generates a response, using another model or rule engine to check whether the output violates security policies, and if so, blocking the output and falling back to a safe default response.[18]
Red-Team Testing and Continuous Security. OWASP (the Open Web Application Security Project) has listed Prompt Injection as the number-one security risk for LLM applications and has published an LLM Top 10 security risk list.[19] When deploying LLM applications, enterprises should establish routine red-team testing mechanisms — dedicated security teams regularly testing the system's prompt defenses from an attacker's perspective. Major AI labs such as Anthropic, OpenAI, and Google DeepMind all conduct large-scale red-team testing prior to model releases, but enterprise deployments often introduce new attack surfaces at the system integration level, requiring specialized security testing tailored to specific business scenarios.
V. Automated Prompt Optimization: From Manual Tuning to Algorithmic Search
One of the most exciting frontiers in Prompt Engineering is the shift from manually writing and tuning prompts to algorithmically searching for optimal prompts. This direction is advancing prompt engineering from "art" toward "engineering science."
DSPy: Declarative Prompt Programming. The DSPy framework, developed by Khattab et al. (2023) at Stanford, represents a milestone in prompt engineering automation.[20] DSPy's core philosophy is to transform prompt engineering from natural language writing into programming — rather than manually crafting prompt text, developers define task "Signatures" (semantic descriptions of inputs and outputs) and "Modules" (reasoning strategies such as ChainOfThought and ReAct) in Python code, and DSPy's Compiler automatically searches for the optimal prompt implementation. The DSPy compiler iteratively experiments with different prompt strategies, example selections, and parameter combinations on a validation set, using task-specific evaluation metrics as the objective function to automatically discover high-performing prompt configurations. This framework is revolutionary because it shifts prompt engineering quality from dependence on individual experience and intuition to dependence on algorithmic search and statistical validation.
OPRO: The LLM as Its Own Prompt Optimizer. The OPRO (Optimization by PROmpting) framework proposed by Zhou et al. (2023) offers another elegant solution — directly leveraging the LLM itself as a prompt optimizer.[21] It works as follows: a set of candidate prompts and their performance scores on an evaluation set are fed to the LLM as an "optimization context," and the model is then asked to generate potentially better new prompts based on this historical data. This process iterates — the best prompts from each round are added to the optimization context, guiding the next round of search toward better solutions. OPRO also produced an unexpected finding: the optimal prompts discovered automatically by LLMs often differ radically from those designed by human intuition — some are not even fully coherent semantically, yet they significantly outperform versions carefully crafted by human experts on task performance. This suggests a fundamental divergence between how LLMs "understand language" and human cognition — the "effective instructions" that machines respond to do not necessarily conform to human linguistic intuitions.
Enterprise Implications of Automated Prompt Optimization. For enterprises, the maturation of automated prompt optimization tools implies a threefold transformation. First, lowering the expertise barrier — organizations no longer need to rely on "prompt masters" and their personal artistry, but can reliably produce high-quality prompts through engineered workflows. Second, adapting to model evolution — when the underlying LLM is updated, automated tools can rapidly re-optimize prompts without starting from scratch manually. Third, scaling deployment — when an enterprise simultaneously operates hundreds of LLM-driven business processes, manually managing every prompt is impractical; automated optimization and monitoring is the only scalable approach.
VI. Multimodal Prompt Strategies: Prompt Design Beyond Text
With the emergence of multimodal large models such as GPT-4V, Gemini, and Claude, the domain of Prompt Engineering has expanded from pure text to images, audio, video, and other multimodal inputs. This introduces entirely new design challenges and opportunities.
Visual Prompting. In multimodal models, images are not merely objects to be analyzed — they can also serve as part of the "prompt," guiding the model to understand task context.[22] In practice, visual few-shot prompting (providing example images with corresponding annotations to define a task) performs well in object recognition, chart comprehension, and document parsing. However, visual prompt design faces unique challenges: images have far higher information density than text, making it harder to predict how the model allocates "attention" across different regions; image resolution, cropping, and color characteristics can all affect model comprehension; and current models still exhibit notable deficiencies in understanding spatial relationships (above, below, left, right) and counting tasks (how many objects are in the image).
Cross-modal Prompt Strategies. A more cutting-edge research direction explores the complementary and reinforcing effects between different modalities. For instance, when analyzing a technical document, providing both the scanned image and the OCR-extracted text — the image supplies layout and chart information, while the text provides precise linguistic content. In audio comprehension tasks, supplying both the audio clip and its text transcription enables the model to combine acoustic features such as tone and pacing with semantic content for more accurate interpretation.[23] The design principle of multimodal prompt strategies is to leverage the complementarity of different modalities to reduce the model's interpretive uncertainty — when information from one modality is insufficient, another modality can provide supplementation and verification.
Structured Output and Tool-Use Prompt Design. An increasingly important dimension of multimodal prompt engineering is guiding the model to generate structured outputs (such as JSON, XML, or Markdown tables) and to invoke external tools. This requires prompt designers to understand not only natural language communication principles but also data structure and API design patterns. In real-world enterprise applications, model outputs typically need to be parsed and processed by downstream systems — a malformed JSON output can cause an entire automated workflow to fail. Consequently, prompts must include precise output format definitions, boundary condition handling rules, and error-handling fallback strategies.[24]
VII. From Prompt Engineering to AI Agent Architecture Design
The evolutionary trajectory of Prompt Engineering clearly points toward a grander direction: AI Agent architecture design. In fact, the most advanced AI Agent systems today — whether AutoGPT, OpenClaw, or enterprise-grade Salesforce Agentforce — can all be understood at their core as the coordinated operation of carefully designed prompt modules.
Prompts as the Agent's "Cognitive Architecture." In a typical AI Agent system, prompts play at least four core roles: the system prompt defines the Agent's identity, goals, and behavioral constraints (analogous to human "values" and "professional norms"); the task prompt defines the specific task to be completed and its success criteria (analogous to "work instructions"); the reasoning prompt defines how the Agent thinks and plans (analogous to "methodology" and "problem-solving strategies"); and the tool-use prompt defines which tools the Agent can invoke and when (analogous to a "skill set" and "tool usage manual").[25] The design quality of these prompt modules directly determines the Agent's capability boundaries and behavioral reliability.
Prompt Coordination in Multi-Agent Systems. In more complex multi-agent systems, Prompt Engineering must also address the "communication protocol" between Agents — how different role-playing Agents exchange information, coordinate actions, and resolve conflicts. Park et al. (2023) demonstrated this in their "Generative Agents" research, where 25 LLM-driven virtual characters autonomously interacted, formed social relationships, and lived in a simulated town — each character's behavior was entirely driven by "memory," "reflection," and "planning" mechanisms defined in their prompts.[26] This research hints at a profound possibility: Prompt Engineering is not only about designing communication between humans and machines but may evolve into designing communication between machines — a new form of "AI social engineering."
Security Implications of Agent Architecture. When prompts become the AI Agent's "cognitive architecture," prompt security becomes equivalent to Agent behavioral safety. An AI Agent with a successfully injected malicious prompt could autonomously execute unauthorized operations — sending phishing emails, modifying critical system configurations, or exfiltrating sensitive data. This means the security dimension of Prompt Engineering is no longer merely about "preventing the model from generating inappropriate content" but about "preventing autonomous systems from executing unauthorized actions" — a fundamental security escalation. Anthropic's Constitutional AI approach and OpenAI's Instruction Hierarchy framework are both important attempts to establish Agent behavioral safety at the prompt level.[27]
VIII. Conclusion: Prompt Engineering as a New Paradigm for Human-Machine Interfaces
Reviewing the academic foundations, core methodologies, enterprise applications, security challenges, and future directions covered in this article, we can articulate a central thesis: Prompt Engineering is evolving from a provisional "usage technique" into an independent discipline with theoretical foundations, a methodological framework, and engineering practice standards.
This discipline's academic foundations are rooted in pragmatic theory from linguistics, cognitive load theory from cognitive science, and the pre-training paradigm from computational linguistics. Its core methodologies span a progressive reasoning framework from zero-shot to Tree-of-Thoughts. Its engineering practice has advanced from individual manual tuning to enterprise-grade template engineering and automated optimization. Its security dimension has expanded from preventing malicious outputs to safeguarding the behavioral safety of autonomous systems.
For enterprise decision-makers, three strategic implications stand out. First, invest in building prompt engineering capabilities. In the ROI equation of AI investment, model capability is only half the story — the other half is the ability to effectively guide the model. A well-designed prompt strategy can extract multiples of value from the very same model. Second, establish systematic prompt governance mechanisms. As AI adoption spreads, prompt management will become a critical organizational capability — just as important as code quality management and data governance. Third, pay close attention to the evolution from prompt engineering to agent engineering. Prompt Engineering is the foundation of AI Agent architecture design — understanding how to effectively "instruct" an LLM is the prerequisite for understanding how to effectively "design" an autonomous AI system.
Ultimately, the significance of Prompt Engineering transcends the technical level — it represents humanity's first attempt to "program" a machine using its most natural mode of communication: natural language. This means the barrier to "using AI" has been lowered to the level of human language ability — the most fundamental democratization in the history of computing. Yet precisely because of this, ensuring the quality, safety, and fairness of this new form of human-machine communication has become a defining question of our era. From intuition to science, from craft to methodology, the Prompt Engineering revolution is only just beginning.
References
- Liu, P. et al. (2023). Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Computing Surveys, 55(9), 1–35. doi.org
- Grice, H. P. (1975). Logic and Conversation. In Syntax and Semantics 3: Speech Acts, pp. 41–58. Academic Press.
- Sweller, J. (1988). Cognitive Load During Problem Solving: Effects on Learning. Cognitive Science, 12(2), 257–285. doi.org
- Liu, P. et al. (2023). Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Computing Surveys, 55(9), 1–35. doi.org
- Brown, T. et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems (NeurIPS), 33, 1877–1901. arxiv.org
- Zhao, Z. et al. (2021). Calibrate Before Use: Improving Few-Shot Performance of Language Models. Proceedings of ICML 2021. arxiv.org
- Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems (NeurIPS), 35. arxiv.org
- Kojima, T. et al. (2022). Large Language Models are Zero-Shot Reasoners. Advances in Neural Information Processing Systems (NeurIPS), 35. arxiv.org
- Yao, S. et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Advances in Neural Information Processing Systems (NeurIPS), 36. arxiv.org
- Yao, S. et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. Proceedings of ICLR 2023. arxiv.org
- Wang, X. et al. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. Proceedings of ICLR 2023. arxiv.org
- Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems (NeurIPS), 33. arxiv.org
- Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv preprint. arxiv.org
- White, J. et al. (2023). A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv preprint. arxiv.org
- Chase, H. (2022). LangChain: Building Applications with LLMs through Composability. github.com
- Perez, F. & Ribeiro, I. (2022). Ignore This Title and HackAPrompt: Exposing Systemic Weaknesses of LLMs through a Global Scale Prompt Hacking Competition. Proceedings of EMNLP 2023. arxiv.org
- Greshake, K. et al. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. Proceedings of AISec 2023. arxiv.org
- Yi, J. et al. (2023). Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. arXiv preprint. arxiv.org
- OWASP. (2025). OWASP Top 10 for Large Language Model Applications. owasp.org
- Khattab, O. et al. (2023). DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. Proceedings of ICLR 2024. arxiv.org
- Zhou, Y. et al. (2023). Large Language Models Are Human-Level Prompt Engineers. Proceedings of ICLR 2023. arxiv.org
- Yang, Z. et al. (2023). The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision). arXiv preprint. arxiv.org
- Wu, S. et al. (2023). Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv preprint. arxiv.org
- Shanahan, M. (2024). Talking About Large Language Models. Communications of the ACM, 67(2), 68–79. doi.org
- Wang, L. et al. (2024). A Survey on Large Language Model Based Autonomous Agents. Frontiers of Computer Science, 18(6). arxiv.org
- Park, J. S. et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. Proceedings of UIST 2023. arxiv.org
- Wallace, E. et al. (2024). The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. arXiv preprint. arxiv.org