Part of the
4TU.
Centre for
Engineering Education
4TU.
Centre for
Engineering Education
Close

4TU.Federation

+31(0)6 48 27 55 61

secretaris@4tu.nl

Website: 4TU.nl

Project introduction and background information

The rapid emergence of Generative AI (GenAI) chatbots (tools like ChatGPT capable of producing human-like text on demand) has disrupted higher education. For students, GenAI introduces new possibilities for learning and writing, but also new responsibilities: knowing when and how to use AI meaningfully, evaluating its outputs, and ensuring that submitted work reflects their own learning. For educators and institutions, the challenge is more direct: when a student can delegate writing to an AI, traditional assessment products (essays, reports, answers) can no longer be treated straightforwardly as evidence of learning. The validity of assessment is directly affected.

This project, funded jointly by 4TU.CEE (Centre for Engineering Education) and TU/e's BOOST! programme, set out to explore assessment methods that address these challenges. Rather than treating GenAI as a threat to be detected and banned, the project took a constructive stance: if students are going to use these tools (and they do), how can educators design assessments that remain valid and capture genuine learning even in AI-integrated classrooms? The project focused on process-oriented assessment approaches, shifting attention from the final written product to the dialogue between student and AI: what students ask for, how they steer the interaction, and what domain knowledge they bring to it.

Across four interconnected work packages, the project reviewed the literature on AI-compatible assessment, developed and validated a novel assessment framework (DRIVE) and accompanying taxonomy, piloted RAG-based AI tutoring tools in a classroom setting, and engaged with educators and students through workshops, webinars, and outreach activities within and beyond TU/e. Together, the work packages produced a set of empirically grounded, practically oriented resources for educators working in AI-integrated teaching contexts.

Objective and expected outcomes

The main objectives and expected outcomes of the project are manifold, consisting of the following working packages:

WP1: Mapping the literature on future-oriented and GenAI compatible higher education

WP1 mapped the international research landscape on GenAI in higher education, with two specific questions: What future-oriented learning objectives are being identified as relevant in an AI-integrated world? And how are educators transforming their assessment practices in response? The review identified a general consensus around shifting from product-focused to process-oriented assessment, reorienting learning objectives toward higher-order skills (critical thinking, ethical reasoning, adaptability, AI literacy), and investing in faculty development. It also established the theoretical and empirical basis for the design choices made in subsequent work packages.

Outcomes:

  • Report 1a (see downloads in this page)

WP2: Designing a framework for learning assessment through AI interaction analysis

WP2 developed and validated the DRIVE framework and an accompanying 35-item taxonomy for evaluating student learning through interactions with GenAI chatbots. The framework distinguishes two complementary indicators of learning visible in interaction logs: Directive Reasoning Interaction (DRI), which captures how actively and purposefully the student steers the AI, and Visible Expertise (VE), which refers to the extent to which the student makes acquired course knowledge visible in their prompts. The taxonomy categorises interactions across three domains (Writing, Content, Argument) and was developed iteratively from actual student interaction logs in STEM writing-intensive courses. Empirical validation showed a positive correlation between interaction quality scores and traditional essay grades (Pearson r = 0.54, p < .001, N = 70), supporting the feasibility of process-based assessment.

Outcomes:

  • Journal publication: Oliveira, M., Zednik, C., Bombaerts, G., Sadowski, B., & Conijn, R. (2025). Assessing students’ DRIVE: A framework to evaluate learning through interactions with generative AI. Computers and Education: Artificial Intelligence, 100497.  https://doi.org/10.1016/j.caeai.2025.100497
  • Report 2a presents the framework and empirical findings;
  • Report 2b is a practical user guide written directly for teachers who wish to implement the taxonomy in their own courses.


 WP3: Pedagogical tools — RAG-based AI tutors in the classroom

WP3 piloted the use of RAG (Retrieval-Augmented Generation) chatbots as voluntary learning tools in a Bachelor-level Cognitive Psychology course (N = 116 students sitting the final exam). Students were offered access to two course-specific chatbots during the three weeks before their exam. Results showed that when RAG tools are offered without structured pedagogical integration, students tend to use them superficially, primarily for last-minute topic clarification, and no significant improvement in exam performance was observed. The report concludes that providing access to AI tools alone does not produce learning gains; structured activities that encourage deeper, agentic engagement are more likely to do so.

Outcomes:

  • Report 3 (see downloads in this page)

WP4: Workshops on GenAI tools for educational activities

WP4 delivered workshops and learning activities for both students and staff at TU/e, covering AI literacy, effective prompt engineering, engagement with AI outputs, and the use and tailoring of chatbots for learning. These activities provided hands-on experience with GenAI tools in educational contexts and served as a testing ground for the frameworks and insights developed in WP1–3. WP4 activities also contributed to the outreach activities described below.

Outcomes:

  • Multiple workshops, see also webinar in downloads

Results and learnings

Scientific contributions

Assessing students’ DRIVE: A framework to evaluate learning through interactions with generative AI

Reference: Oliveira, M., Zednik, C., Bombaerts, G., Sadowski, B., & Conijn, R. (2025). Assessing students’ DRIVE: A framework to evaluate learning through interactions with generative AI. Computers and Education: Artificial Intelligence, 100497.  https://doi.org/10.1016/j.caeai.2025.100497

This peer-reviewed journal article introduces the DRIVE framework and its supporting taxonomy, and presents the empirical validation study conducted in STEM ethics courses at TU/e. The study analysed 1,450 annotated GenAI interactions from 70 graded essays across three course groups (2023–2025). The main finding is that the quality of a student's interaction with an AI chatbot is a valid and reliable indicator of their domain-specific learning: students who demonstrated higher Visible Expertise and Directive Reasoning in their prompts also tended to produce higher-quality essays, and vice versa.

The taxonomy revealed four distinct GenAI usage profiles associated with different levels of mastery. High-performing students tended to engage in either a collaborative intellectual partnership (when assessed on interaction quality), bringing original ideas, critiquing AI outputs, and steering conceptual development, or a targeted improvement partnership (when assessed by essay grade), systematically refining text and integrating feedback. Low-performing students were more likely to exhibit basic information retrieval (seeking definitions and examples with little direction) or passive task delegation (copy-pasting assignment instructions to the AI). Simply using GenAI did not improve essay performance: there was no significant difference in essay scores between AI users and non-users. What mattered was how students used it.

Student profiles and perspectives on being assessed on the use of Generative AI for graded coursework

[Working paper available in downloads in thispage. Preliminary results. Contact the corresponding author before citing] Oliveira, M., Zednik, C., Sadowski, B., Bombaerts, G., & Conijn, R. (in preparation) Student profiles and perspectives on being assessed on the use of Generative AI for graded coursework .

This working paper addresses a question left open by the DRIVE study: what do students experience when they are formally assessed on their GenAI interactions? The paper examines AI adoption decisions and learning experiences among graduate students (N = 45) completing individually graded argumentative ethics essays in a context where AI use was explicitly permitted, with interaction logs submitted and graded alongside the essay.

Using a mixed-methods design combining psychometric surveys (writing self-efficacy, need for cognition, AI literacy), verified adoption records, prompt-log annotations, and open-ended reflective responses, the study finds that AI literacy (not writing confidence or cognitive style) is the individual-level factor most strongly associated with whether students chose to use GenAI. Qualitative analysis revealed five distinct adoption rationales: strategic efficiency (AI as scaffolding), skill compensation and augmentation, performative compliance (adopting because it was graded and encouraged), intellectual agency (principled non-adoption to preserve authorship or manage risk), and ambivalence. A parallel thematic analysis of learning experience reflections identified benefits (source discovery, cognitive scaffolding, efficiency) alongside concerns about AI reliability and hallucination, and a minority disposition of active intellectual ownership maintenance that distinguished students' use of AI as an execution assistant from its role as an idea generator.

The paper offers an initial empirical account of a largely unexamined context: what it is like, as a student, to have your use of GenAI formally assessed.

Outreach

The DRIVE framework and taxonomy, introduced in Oliveira et al. (2025), have been presented to educators across a range of contexts within and beyond TU/e.

Internally, the framework was presented in workshops and sessions across multiple departments, reaching colleagues from different disciplinary backgrounds and helping educators consider how process-focused assessment of GenAI use might be adapted to contexts beyond the philosophy and ethics courses in which it was originally developed.

Externally, the framework has been presented at other Dutch universities that are not part of the 4TU network, extending its reach to a broader community of higher education educators and researchers.

Across the 4TU network, the framework was the focus of a dedicated 4TU.CEE webinar held on 18 November 2025, bringing together educators and researchers from the four Dutch technical universities (TU/e, TU Delft, University of Twente, and Wageningen University & Research). The webinar, "What student prompts reveal about their learning: Introducing the DRIVE framework", presented the research findings and invited educators from other disciplines to consider whether and how prompt-grading could work in their own teaching contexts.

  • Webinar slideshow available in downloads in this page

Recommendations

  • WP1

Recommendations based on the literature review from WP1

Higher education institutions should focus on redesigning their curricula and assessment methods to emphasize skills that AI cannot easily replicate, and possibly design novel courses that tackle the increasing need for AI literacy, critical thinking, and human-technology interaction ethics. This implies shifting from content-focused instruction to developing higher-order thinking skills through activities that are less easily offloaded to AI systems. It should be noted, however, that what constitutes desirable or undesirable use of AI ultimately depends on the intended learning objectives (ILOs) of a course. Investing in increasing AI literacy of teachers should facilitate the design of courses that are more harmoniously coexisting with the technology. In practice, this means a better alignment between ILOs, pedagogical activities and assessment approaches. One example could be a course where some activities involve learning how to responsibly co-write essays with generative AI followed by the assessment of the interaction between the student and the AI throughout the writing process (i.e., prompt analytics). If the ILOs emphasize the development of core competencies that AI systems have already demonstrated a high degree of capability in executing, yet simultaneously enable students to more effectively scrutinize and assess the outputs generated by AI, teachers should consider facilitating the teaching and evaluation of such skills within an AI-free pedagogical environment.

The majority of assessed student output in higher education is in verbal format, such as essays, reports, and presentations. This type of output is directly threatened by the extremely high capability of large language models (LLMs) and other generative AI tools to easily produce and manipulate verbal content. This requires a rethinking of how teachers assess learning.  With the increasing capabilities of AI to take over otherwise hard-earned skills involved in thinking and writing tasks, assessment strategies should move away from traditional essays and exams toward performance-based evaluation methods that demonstrate authentic learning and application of knowledge. This includes implementing more real-time assessments like presentations, group projects, and case studies that require students to demonstrate critical thinking, problem-solving, and creativity in real-time while applying their knowledge in a given context.


WP2. How to assess learning through the analysis of student-GenAI interactions

To assess learning in contexts where GenAI is permitted, instructors must shift their focus from the final product to the learning process itself. This work package proposes the DRIVE framework as a method to evaluate this process by analyzing student-GenAI interaction logs. The framework assesses two core components: Directive Reasoning Interaction (DRI), which measures how students critically steer the AI, and Visible Expertise (VE), which identifies how students articulate their acquired domain knowledge within the dialogue. While this project utilized a specific interaction taxonomy tailored to argumentative writing to operationalize these concepts, the DRIVE framework itself is designed to be adaptable across different domains and assessment types.

Empirical validation of this approach revealed a strong positive correlation ($r=0.54$) between the quality of students' GenAI interactions and their final academic outcomes. High performance was associated with a "collaborative intellectual partnership" profile, characterized by students posing original ideas, refining concepts, and critically evaluating AI outputs. In contrast, lower outcomes were correlated with "passive task delegation" or basic information retrieval, where students relied on the AI to generate content without significant steering or knowledge infusion. Based on these findings, we offer the following recommendations for practice:

  • Explicitly define the learning evidence: Clearly articulate whether the assessment focuses on technical AI literacy (e.g., prompting skills) or domain-specific learning (e.g., Visible Expertise in the prompt)
  • Require and evaluate interaction logs: Make the submission of complete interaction logs a requirement in your assessment guidelines. This allows to gain visibility into the student's Directive Reasoning Interaction (DRI) and their agency in the co-creation process
  • Design tasks promoting partnership (but think if incentive is right...): Develop assignments that require students to use GenAI as a "thinking partner" for conceptual refinement and critique, rather than for simple production or information gathering. However, you must reflect on whether this is an acceptable use of GenAI in line with your learning objectives or pedagogical context.
  • Distinguish interaction profiles: Teach students to move beyond "passive task delegation" behaviors and model "collaborative intellectual partnership" strategies to support deeper learning. Again, here you must reflect on whether this is an acceptable use of GenAI in line with your learning objectives or pedagogical context.

WP3: Pilot studies using GenAI based tutoring applications in the classroom

The preliminary findings from this pilot study can, at this point, already offer several considerations for educators. First, the results suggest that simply providing access to a course-specific RAG chatbot, even one perceived positively by students, is not a guarantee of improved learning outcomes. Instructors should not assume that students will spontaneously use these tools in pedagogically optimal ways. Our current data is suggesting that the common student may engage with these tools at a very superficial level by default (e.g., last-minute clarification). To foster the deeper, agentic engagement associated with positive learning (Smirnova, 2025; Yang et al., 2024), instructors might design specific, structured activities. For example, rather than leaving use entirely open, an educator could require students to use the chatbot to generate practice questions early in a module, or to use the chatbot to find flaws in an argument, or to critique a chatbot-generated summary of a complex topic. This approach shifts the student's role from passive consumer to an active critical evaluator. Finally, educators should remain mindful of the technological "lag" discussed in the limitations. If institutional tools are perceived as less capable than rapidly evolving commercial alternatives, students may ignore them.

References

  • Smirnova, L. (2025). Developing students’ agency and voice by using generative AI in an online EAP module. Innovation in Language Learning and Teaching, 1–11. https://doi.org/10.1080/17501229.2025.2538781
  • Yang, Y., Luo, J., Yang, M., Yang, R., & Chen, J. (2024). From surface to deep learning approaches with Generative AI in higher education: An analytical framework of student agency. Studies in Higher Education, 49(5), 817–830. https://doi.org/10.1080/03075079.2024.2327003

Practical outcomes

The following recommendations are drawn from across the project's work packages, published article, and working paper. They are addressed to educators who teach in, or are considering, AI-integrated courses where students may use GenAI chatbots for graded work.

On assessment design

  • Consider combining essay assessment with interaction log evaluation. Assessing the final essay alone provides an incomplete picture in AI-integrated classrooms. Requiring students to submit their interaction logs alongside their written work adds a complementary view of their learning process that the essay alone does not offer. The two assessment methods capture overlapping but distinct aspects of student engagement.
  • Process-focused assessment rewards different behaviours than product-focused assessment. Traditional essay grading tends to reward systematic text refinement and conceptual integration; interaction log assessment tends to surface original idea development and critical engagement that may not be fully visible in the polished final text. Both capture genuine learning, but the assessment method shapes which behaviours students prioritise.
  • Make sure that non-adoption carries no inherent grade penalty. Not all students will or should use GenAI. Principled non-adoption (to preserve authorship, develop independent writing skills, or manage grade risk) is a legitimate pedagogical choice. Assessment designs should make this choice genuinely voluntary rather than inadvertently penalised through heavy weighting of the AI-use documentation component.
  • Always align rubrics assessing process of using GenAI with your course learning objectives. What counts as high-quality interaction depends on the discipline and the intended learning outcomes. The DRIVE framework is flexible by design: the educator's learning objectives define what Visible Expertise (how can we see the students' knowledge leaking into the workflow) and Directive Reasoining Interaction ((how can we see the student intent and evaluative behaviors throughout the logged interactions) look like in a given course, insofar as these learning objectives can be documented in some symbolic format during an interaction between students and the AI system. Rubrics should be developed thoughtfully and communicated clearly to students before the task begins, as well as revised every source iteration to adapt to any technological landscape changes.

On student preparation

  • Treat AI literacy as a design consideration, not a given.  AI literacy (knowing how to use GenAI tools effectively and appropriately) is unevenly distributed among students. Before assigning graded AI-integrated tasks, embed brief, task-specific preparation: not generic tool demonstrations, but guided practice with the type of AI-assisted process the task actually requires.
  • Provide concrete examples of what good GenAI use looks like in your course. Students who are uncertain about what appropriate use entails may avoid the tools entirely, adopt superficially to satisfy perceived expectations, or outsource intellectual work they should be doing themselves. Worked examples and scaffolded demonstrations reduce this ambiguity.
  • Consider exercises that expose students to AI failure modes. Awareness of hallucination and source fabrication tends to develop through first-hand experience rather than abstract warnings. Activities that require students to verify AI-generated references or compare AI summaries against original sources can build verification habits more durably than general cautionary advice.

On pedagogical integration of AI tools

  • Providing access to a GenAI or RAG tool alone does not produce learning gains. When AI tools are offered as optional, unscaffolded resources, students tend to use them superficially, primarily for last-minute clarification, and no significant improvement in learning outcomes has been observed. Structured activities that encourage deeper, agentic engagement (e.g., generating practice questions early in the course, critiquing AI-generated arguments, using AI to develop and test one's own ideas) are more likely to produce learning benefits.
  • Design activities that reward collaborative intellectual partnership rather than passive task delegation. High-performing students in the project's studies used GenAI as a thinking partner: bringing their own ideas, pushing back on AI outputs, and steering the conversation toward conceptual development. Lower-performing students tended to copy-paste instructions or ask for definitions. Assessment design and learning activities should encourage the former.
  • Focus on developing transferable AI literacy rather than tool-specific fluency. Institutional AI tools evolve and are frequently replaced. Pedagogy focused on critical thinking, effective interaction strategies, and output evaluation will serve students across tools and over time; pedagogy focused narrowly on a specific platform will not.

On understanding students' perspectives

  • Requiring graded submission of interaction logs may reduce voluntary AI adoption. The project found that requiring documentation and grading of AI interactions was associated with a decrease in students' willingness to use these tools at all. Before implementing process-focused assessment, consider the context and how the policy is communicated.
  • Different non-adopters may need different support. Students who actively chose not to use AI on principled grounds (authorship, intellectual agency) and students who were ambivalent but uncertain what good use looked like are meaningfully different groups. The former benefit from assurance that non-adoption is genuinely supported; the latter benefit from clearer scaffolding and worked examples.
  • Making the distinction between AI as execution tool and AI as idea generator explicit can help. Some students spontaneously maintained clear intellectual ownership while using AI; many did not. Task framing that asks students to distinguish which steps were AI-assisted and which were independently reasoned, or that requires annotation of their interaction logs, can scaffold this distinction and generate visible evidence of genuine intellectual engagement.

On scaling and automation

  • Automated classification of interaction logs is promising but still requires human oversight. The project's experiments with GPT-4o for automated taxonomy classification showed substantial AI self-consistency (Fleiss Kappa = 0.78) but only fair human-AI agreement (Kappa 0.3–0.4). Automated tools can support the assessment process, particularly for preliminary sorting and pattern identification, but human review remains important, especially for assessing sophisticated intellectual collaboration. Teachers (in philosophical writing-intensive courses) report an average of approximately 15 minutes per interaction log for manual assessment.