Assessing the reliability of large language models for deductive qualitative coding: A comparative intervention study with chatgpt
Date:
I this talk at the ASIS&T Annual Meeting I presented my paper “Assessing the reliability of large language models for deductive qualitative coding: A comparative intervention study with chatgpt”. The paper assesses whether ChatGPT can achieve standard benchmarks in structured deductive classification tasks. The study uses the Comparative Agendas Project Supreme Court classes as a human-coded benchmark, finding that with a step-by-step task decomposing strategy aproximating chain-of-thought prompting, ChatGPT reaches acceptable levels of interrater reliability and accuracy levels approximating a custom-trained classifier such as ROBERTA. My talk focused on the implications of utilizing chatbots as approximate human coders or assistants.
