Leveraging Human Expertise for LLM-Assisted Dialogue Character Extraction and Attribution in Classic Chinese Novel

Yutong Yang, Shanghai Jiao Tong University

Frontend Interface

1. Introduction

Classic Chinese Novels - Using Distance Reading to Understand Complex Narratives

· Revealing new insights into character relationships and narrative roles...

Social Network Analysis - Requiring Accurate Data Extracted from the Novels

· Challenges in classic Chinese novels

· Challenges in data extraction

Challenges in Classic Chinese Novels

  • Complex character networks
  • Rich contexts
  • Multiple aliases per character
红楼梦人物关系网络图

Dream of the Red Chamber

红楼梦人物关系网络图 红楼梦人物关系网络图 红楼梦人物关系网络图

Challenges in Data Extraction

  • Manual extraction of characters from classic Chinese novels requires significant human effort and is highly time-consuming.
  • The richness of context in novels and the complexity of character references present challenges for fully automated extraction methods.

2. Framework Overview

Back-end Processing
  • Data extraction
  • LLM processing
  • Character identification
Front-end Interface
  • Interactive annotation
  • Visualization
  • Manual refinement
整体框架架构图

3. Back-end Data Processing

3.1 Upload and Segmentation

Text Upload

Original novel text upload

Rule-based Segmentation

Break into dialogue units with context

文本分割 文本分割

3.2 LLM-Based Extraction

Speaker and Listener Extraction

prompt = f"\nQ: I will give you a dialogue sentence and a passage of context. Please repeat the dialogue sentence, then based on the context, identify the speaker, the primary listener(s), and the secondary listener(s) in the specified dialogue sentence. Dialogue sentence: {talk}. Context: {context}. Please provide your answer in the format: 'Dialogue sentence: [dialogue], Speaker: [speaker], Primary Listener(s): [primary listener(s)], Secondary Listener(s): [secondary listener(s)]'. Note: 1. Use commas (",") to separate the dialogue sentence, speaker, primary listener(s), and secondary listener(s); 2. If there are multiple speakers or multiple primary/secondary listeners, separate them using "、"; 3. Do not insert any line breaks in your answer; 4. Resolve pronoun references carefully and avoid vague references like "you", "I", "he", etc.; 5. Do not include any explanation or analysis in your answer — treat it like a fill-in-the-blank question. The more concise, the better; 6. If the speaker, primary listener(s), or secondary listener(s) cannot be identified, respond with 'None'. \nA:"

⇢ Add to Database

3.3 LLM-Assisted Attribution

Resolve Co-references

prompt = f"\nQ:This is a fill-in-the-blank question with an answer of 0 or 1. Please determine: In the book '{chinese_book_name}', are '{entity}' and '{main_entity}' the same character? If yes, return 1; if not, return 0. Do not consider literary implications. No explanation or analysis is needed." \nA:"

⇢ Add to Database

4. Front-end Annotation Interface

Frontend Interface

4.1 Worktable

Upload & Cut Dialogues

Frontend Interface

4.2 Interactive Annotation

Refine annotations: click to switch roles

4.3 Visualization Features

Data visualization design
  • Large nodes (green) = main name
  • Small nodes (yellow) = aliases
  • Links between large nodes = relations (measured with numbers of conversation)
  • Chapter-specific networks: generating dialogue networks for all character relationships up to the current chapter, revealing narrative progression.

4.3 Visualization Features

Manual Disambiguation

Correct extraction errors

5. Simple Insights from the Data Visualization

Key Characters

Important intermediary characters identified

Interaction Patterns

Social dynamics revealed

case

Network of the first chapter

case

Network of the second chapter

case

Network of the third chapter

case

Network of the forth chapter

case

6. Conclusion

6.1 Summary

AI + Human

case

6.2 TODOs

Improving Algorithm Efficiency

Improve the speed of data processing, thus generate the result faster.

Incorporating SNA Algorithm

Use social network analysis algorithm to build a final network, which can directly serve for the literary analysis.

Thank You!

flora20@sjtu.edu.cn

https://yutong-yang.github.io/