การประเมินอย่างลึกซึ้งและความมั่นใจใน AI

ราคา

แพ็คเกจฟรี: พื้นฐานเพียงพอ

แพคเกจเริ่มต้น: $ 20 / เดือน

เริ่มต้นอย่างรวดเร็ว

官方

พารามิเตอร์สําหรับการทดสอบการทํางานที่ลึก

ขนาน

deepeval test run test_example.py -n 4

การแคช

deepeval test run test_example.py -c

ซ้ํา ๆ

deepeval test run test_example.py -r 2

ตะขอ

...

@deepeval.on_test_run_end
def function_to_be_called_after_test_run():
    print("Test finished!")

แนวคิดพื้นฐาน

官方文档


Test Case	包含input/actual_output/retrieval_context
Dataset	Test Case的集合
Golden	相比 test case，少了 `actual_output`

กรณีทดสอบที่เกิดขึ้นจริง

test_case = LLMTestCase(
  input="Who is the current president of the United States of America?",
  actual_output="Joe Biden",
  retrieval_context=["Joe Biden serves as the current president of America."]
)

การประเมินด้วย Pytest

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

dataset = EvaluationDataset(test_cases=[...])

@pytest.mark.parametrize(
    "test_case",
    dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
    answer_relevancy_metric = AnswerRelevancyMetric()
    assert_test(test_case, [answer_relevancy_metric])

@pytest.mark.parametrize เป็นเครื่องตกแต่งที่จัดทําโดย Pytestมันเป็นเพียงการประเมินแต่ละกรณีการทดสอบโดย 'ชุดข้อมูลการประเมินผล'

ทํางานโดยไม่ต้องใช้ CLI

文档

# A hypothetical LLM application example
import chatbot
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

# ……
test_cases = [first_test_case, second_test_case]

metric = HallucinationMetric(threshold=0.7)
evaluate(test_cases, [metric])

กรณีทดสอบ

กรณีทดสอบมาตรฐาน

test_case = LLMTestCase(
    input="What if these shoes don't fit?", #必选
    expected_output="You're eligible for a 30 day refund at no extra cost.", #必选
    actual_output="We offer a 30-day full refund at no extra cost.",
  
    context=["All customers are eligible for a 30 day full refund at no extra cost."], # 参考值
    retrieval_context=["Only shoes can be refunded."], # 实际检索结果
  
    latency=10.0
)

`บริบท' เป็นผลการค้นหาที่เหมาะสําหรับป้อนข้อมูลที่กําหนดโดยปกติจากชุดข้อมูลการประเมิน

retrieval context ผลการค้นหาที่เกิดขึ้นจริงสําหรับแอปพลิเคชัน LLM.

ข้อมูลชุด

สร้างชุดข้อมูลด้วยตนเองและผลักดันให้กับ Confident AI

from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset

# 原始数据
original_dataset = [
    {
        "input": "What are your operating hours?",
        "actual_output": "...",
        "context": [
            "Our company operates from 10 AM to 6 PM, Monday to Friday.",
            "We are closed on weekends and public holidays.",
            "Our customer service is available 24/7.",
        ],
    },
    {
        "input": "Do you offer free shipping?",
        "actual_output": "...",
        "expected_output": "Yes, we offer free shipping on orders over $50.",
    },
    {
        "input": "What is your return policy?",
        "actual_output": "...",
    },
]

# 遍历，将生成 LLMTestCase 实例
test_cases = []
for datapoint in original_dataset:
    input = datapoint.get("input", None)
    actual_output = datapoint.get("actual_output", None)
    expected_output = datapoint.get("expected_output", None)
    context = datapoint.get("context", None)

    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output,
        expected_output=expected_output,
        context=context,
    )
    test_cases.append(test_case)

# 将 LLMTestCase 数组变成 EvaluationDataset
dataset = EvaluationDataset(test_cases=test_cases)

# 推送到Confident AI
dataset.push(alias="My Confident Dataset")

链接

ดูผลลัพธ์

在Confident AI的Web UI手动创建

สนับสนุนภาษาจีน

แก้ไขพร้อมท์ทั้งหมดภายใต้ /Users/xxx/anaconda3/envs/LI311-b/lib/python3.11/site-packages/deepeval/synthesizer/template.py ด้วยตนเอง เพิ่ม

6. `Rewritten Input` should be in Chinse.

เมตริก

มิติของการประเมิน

评估指标	描述
正确性和语义相似度	生成的答案与参考答案的对比
Context Relevancy	查询与检索到的上下文的相关性
Faithfulness	生成的答案与检索到的上下文的一致性
Answer Relevancy	生成的答案与查询的相关性

ราคา

เริ่มต้นอย่างรวดเร็ว

พารามิเตอร์สําหรับการทดสอบการทํางานที่ลึก​

แนวคิดพื้นฐาน

การประเมินด้วย Pytest​

ทํางานโดยไม่ต้องใช้ CLI​