在初步探索了 AI 函數的可能性之後,本次我們將目光投向兩個更為核心的函數:AI_AGG 和 EMBED。我們將深入解析這兩個函數的設計理念、實現原理及其在業務場景中的應用,展示 Apache Doris 如何通過原生的函數設計,將文本聚合與語義向量分析無縫集成到 SQL 中,為用户提供更強大、更易用的智能數據分析體驗。
相關閲讀:
Apache Doris 4.0 AI 能力揭秘(二):為企業級應用而生的 AI 函數設計與實踐 Apache Doris 4.0 AI 能力揭秘(一):AI 函數之 LLM 函數介紹 AI_AGG:基於 AI 的文本聚合
聚合是數據分析中最常見的操作,但如果聚合的對象是海量的用户評論、支持工單或日誌文本,傳統的聚合函數難以直接處理這類非結構化文本數據。為此,Doris 支持了 AI_AGG,一個能夠調用 AI 對文本進行聚合的函數。它讓分析師可以根據自定義指令,從大量的文本中處理特定的任務。
01 使用方法及示例
AI_AGG 詳細用法請參考: https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/aggregate-functions/ai-agg 示例 1:
下表模擬一個簡易的客服工單:
SELECT * FROM support_tickets; +-----------+---------------+------------------+----------------------------------------------------------------------------------+ | ticket_id | customer_name | subject | details | +-----------+---------------+------------------+----------------------------------------------------------------------------------+ | 2 | Bob | Login Failure | Same problem as Alice. Also seeing 502 errors on the SSO page. | | 3 | Carol | Payment Declined | Credit card charged twice but order still shows pending. | | 5 | Eve | Login Failure | Getting redirected back to login after entering 2FA code. | | 1 | Alice | Login Failure | Cannot log in after password reset. Tried clearing cache and different browsers. | | 4 | Dave | Slow Dashboard | Dashboard takes >30 seconds to load since the last release. | +-----------+---------------+------------------+----------------------------------------------------------------------------------+ 我們可以通過 AI_AGG 對不同問題類型下客户的問題進行總結
SELECT subject, AI_AGG( details, 'Summarize every ticket detail into one short paragraph' ) AS ai_summary FROM support_tickets GROUP BY subject; 輸出示例如下:
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | subject | ai_summary | +------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Slow Dashboard | The dashboard is experiencing slow loading times, taking over 30 seconds to load following the most recent release. | | Payment Declined | A customer reports being charged twice for their order, which remains in a pending status. | | Login Failure | Users are experiencing login issues, including 2FA redirection, post-password reset failures, and SSO 502 errors, despite clearing cache and trying different browsers. | +------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 示例 2:
下表簡易模擬了應用反饋表:
SELECT * FROM app_feedback; +-------------+----------------+-----------+--------------------------------------------------------------------------------------+ | feedback_id | feature_module | user_name | feedback_text | +-------------+----------------+-----------+--------------------------------------------------------------------------------------+ | 1 | 機票預訂 | 李雷 | 搜索航班非常快,價格也透明,篩選功能很好用。 | | 2 | 機票預訂 | 韓梅梅 | 支付後出票速度有點慢,等了快半小時,希望能改進。 | | 3 | 機票預訂 | 小陳 | 希望能增加更多廉價航空公司的選項。 | | 4 | 機票預訂 | 馬麗 | App在選擇日期的時候偶爾會卡頓,體驗不是很好。 | | 5 | 酒店預訂 | 王先生 | 酒店信息很全,圖片也真實。但是取消預訂的流程太複雜了。 | | 6 | 酒店預訂 | 劉女士 | 通過App預訂比其他平台便宜,還有會員折扣,非常滿意。 | | 7 | 行程規劃 | 趙四 | 行程規劃功能太棒了,可以自動推薦路線和景點,省了不少心。 | | 8 | 客户服務 | 孫小小 | 聯繫客服很方便,問題解決得也很快,給客服點贊。 | | 9 | 客户服務 | 錢多多 | 電話客服總是佔線,在線客服回覆又很慢,希望能增加人手。 | +-------------+----------------+-----------+--------------------------------------------------------------------------------------+ 我們可以通過 AI_AGG 對用户反饋進行總結:
SELECT feature_module AS '功能模塊', AI_AGG( feedback_text, '請用精確且儘可能簡短地總結這些用户反饋中提到的主要問題和建議' ) AS 'AI反饋總結' FROM app_feedback GROUP BY feature_module; 輸出示例如下
+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 功能模塊 | AI反饋總結 | +--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 客户服務 | 主要問題:客服電話佔線,在線客服回覆慢。
主要建議:增加客服人手。 | | 行程規劃 | 主要問題:無
主要建議:讚賞行程規劃功能,建議繼續優化 | | 酒店預訂 | 主要問題:取消預訂流程複雜。 主要建議:無明確建議,但肯定App預訂價格優勢和會員折扣。 | | 機票預訂 | 主要問題:支付後出票慢(約半小時),App日期選擇卡頓,廉價航空公司選項不足。 主要建議:優化出票速度,修復App卡頓問題,增加廉價航空公司選項。 | +--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 02 AI_AGG 的實現原理
將聚合函數與 AI 結合,需要解決一個分組內文本總量可能遠超模型上下文窗口的問題。如果將所有文本拼接後一次性發送給 AI, 極易出現拼接文本大於模型的最大上下文窗口的情況,而 Doris 中通過動態預聚合解決了這個問題:
02 AI_AGG 的實現原理.png
上下文監控:在聚合拼接文本的過程中,AI_AGG 會為每一個分組維護一個內部的文本緩衝區(目前大小固定為 128K,為絕大部分 AI 可兼容的上下文窗口) 動態預聚合:當新的文本行將要被加入緩衝區並會導致其大小超過閾值時,AI_AGG 會觸發一次預聚合。它會暫停接收新數據,將當前緩衝區內的所有文本作為一個批次發送給 AI 進行一次中間任務處理。 上下文替換:AI 返回的中間處理結果,其長度通常遠小於原始文本。AI_AGG 會用這個精煉後的摘要替換掉緩衝區內原有的長文本,從而為處理更多的新文本行騰出空間。為保證聚合過程的穩定性和防止超出模型服務上限,在替換原有文本後,若加入當前文本行還是會導致緩衝區超過閾值,AI_AGG 會直接報錯退出。 通過這種實現方式,AI_AGG 可以完全融入 Doris 的分佈式查詢計劃,利用多節點並行計算,並由數據庫自動管理聚合過程中的中間狀態。因此,用户可以用熟悉的 SQL 聚合語法,在海量文本上實現高效的智能分析。
EMBED: 文本向量化函數
EMBED 函數的核心功能是通過 AI 將任意文本轉換為高維度的浮點數向量。這個向量是文本在語義空間中的一種數學表示,它捕捉了文本的語義信息。語義相近的文本,其向量在空間中的距離也更近。
EMBED 詳細用法請參考: https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/ai-functions/distance-functions/embed 01 使用方法及示例
示例 1:
下表模擬簡易的行為手冊
CREATE TABLE knowledge_base ( id BIGINT, title STRING, content STRING, embedding ARRAY<FLOAT> COMMENT 'Semantic vector generated by EMBED function' ) DUPLICATE KEY(id) DISTRIBUTED BY HASH(id) BUCKETS 4 PROPERTIES ( "replication_num" = "1" );
INSERT INTO knowledge_base (id, title, content, embedding) VALUES (1, "Travel Reimbursement Policy", "Employees must submit a reimbursement request within 7 days after the business trip, with invoices and travel approval attached.", EMBED("travel reimbursement policy")), (2, "Leave Policy", "Employees must apply for leave in the system in advance. If the leave is longer than three days, approval from the direct manager is required.", EMBED("leave request policy")), (3, "VPN User Guide", "To access the internal network, employees must use VPN. For the first login, download and install the client and configure the certificate.", EMBED("VPN guide intranet access")), (4, "Meeting Room Reservation", "Meeting rooms can be reserved in advance through the OA system, with time and number of participants specified.", EMBED("meeting room booking reservation")), (5, "Procurement Request Process", "Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required.", EMBED("procurement request process finance")); 通過 EMBED 函數對文本的向量化操作,結合 Doris 支持的向量函數, 可對數據進行如下操作:
問答檢索(結合 COSINE_DISTANCE) SELECT id, title, content, COSINE_DISTANCE(embedding, EMBED("How to apply for travel reimbursement?")) AS score FROM knowledge_base ORDER BY score ASC LIMIT 2; +------+-----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+--------------------+ | id | title | content | score | +------+-----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+--------------------+ | 1 | Travel Reimbursement Policy | Employees must submit a reimbursement request within 7 days after the business trip, with invoices and travel approval attached. | 0.4463210454563673 | | 5 | Procurement Request Process | Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required. | 0.5726841578491431 | +------+-----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+--------------------+ 問題分析匹配(結合 L2_DISTANCE) SELECT id, title, content, L2_DISTANCE(embedding, EMBED("How to access the company intranet")) AS distance FROM knowledge_base ORDER BY distance ASC LIMIT 2; +------+-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+--------------------+ | id | title | content | distance | +------+-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+--------------------+ | 3 | VPN User Guide | To access the internal network, employees must use VPN. For the first login, download and install the client and configure the certificate. | 0.5838271122253775 | | 1 | Travel Reimbursement Policy | Employees must submit a reimbursement request within 7 days after the business trip, with invoices and travel approval attached. | 1.272394695975331 | +------+-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+--------------------+ 根據文章內容進行文本相關度匹配並推薦(結合INNER PRODUCT) SELECT id, title, content, INNER_PRODUCT(embedding, EMBED("Leave system request leader approval")) AS score FROM knowledge_base WHERE id != 2 ORDER BY score DESC LIMIT 2; +------+-----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------------+ | id | title | content | score | +------+-----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------------+ | 5 | Procurement Request Process | Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required. | 0.33268885332504 | | 4 | Meeting Room Reservation | Meeting rooms can be reserved in advance through the OA system, with time and number of participants specified. | 0.29224032230852487 | +------+-----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------------+ 尋找差異較小的內容(結合L1_DISTANCE) SELECT id, title, content, L1_DISTANCE(embedding, EMBED("Procurement application process")) AS distance FROM knowledge_base ORDER BY distance ASC LIMIT 3; +------+-----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+ | id | title | content | distance | +------+-----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+ | 5 | Procurement Request Process | Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required. | 18.66882028897362 | | 4 | Meeting Room Reservation | Meeting rooms can be reserved in advance through the OA system, with time and number of participants specified. | 30.90449328294426 | | 2 | Leave Policy | Employees must apply for leave in the system in advance. If the leave is longer than three days, approval from the direct manager is required. | 31.060405636536416 | +------+-----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+ EMBED 的設計不僅是一個文本向量化工具,更是 Doris 向量分析生態的重要組成部分。 通過與其他向量函數(COSINE_DISTANCE, L2_DISTANCE, INNER_PRODUCT, L1_DISTANCE )的無縫集成,EMBED 支持用户在 SQL 查詢中完成從文本到向量,再到相似度計算或檢索的一站式分析。
示例 2:
下表模擬簡易的客服常見問題文檔:
CREATE TABLE support_docs ( id BIGINT, title STRING, content STRING, embedding ARRAY<FLOAT> COMMENT '由 EMBED 函數生成的語義向量' ) DUPLICATE KEY(id) DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES ( "replication_num" = "1" );
INSERT INTO support_docs (id, title, content, embedding) VALUES (1, "系統登錄問題排查", "當用户無法登錄時,請先檢查網絡連接、瀏覽器緩存,並確認賬號未被鎖定。", EMBED("登錄問題排查 網絡異常 賬號鎖定")), (2, "數據備份與恢復指南", "客户可在管理後台手動備份數據,若誤刪可通過支持團隊申請數據恢復。", EMBED("數據備份 恢復 操作指南")), (3, "賬單與發票問題", "企業客户可在財務模塊下載電子發票。如需紙質版,請提交申請工單。", EMBED("賬單 發票 財務模塊")), (4, "API 接口調用規範", "開發者在調用 API 時需攜帶正確的訪問令牌,否則將返回身份驗證錯誤。", EMBED("接口調用 認證錯誤 訪問令牌")), (5, "服務中斷應急流程", "若出現服務中斷,技術團隊需在30分鐘內啓動應急響應併發布公告。", EMBED("服務中斷 應急響應 處理流程")); 通過 EMBED 函數對文本的向量化操作,結合 Doris 支持的向量函數, 可對數據進行如下操作:
問答檢索(結合 COSINE_DISTANCE) SELECT id, title, content, COSINE_DISTANCE(embedding, EMBED("登錄不上系統怎麼辦?")) AS score FROM support_docs ORDER BY score ASC LIMIT 2; +------+--------------------------+--------------------------------------------------------------------------------------------------------+-----------+ | id | title | content | score | +------+--------------------------+--------------------------------------------------------------------------------------------------------+-----------+ | 1 | 系統登錄問題排查 | 當用户無法登錄時,請先檢查網絡連接、瀏覽器緩存,並確認賬號未被鎖定。 | 0.3183002 | | 4 | API 接口調用規範 | 開發者在調用 API 時需攜帶正確的訪問令牌,否則將返回身份驗證錯誤。 | 0.5599254 | +------+--------------------------+--------------------------------------------------------------------------------------------------------+-----------+ 問題分析匹配(結合 L2_DISTANCE) SELECT id, title, content, L2_DISTANCE(embedding, EMBED("接口調用時提示沒有權限")) AS distance FROM support_docs ORDER BY distance ASC LIMIT 2; +------+--------------------------+--------------------------------------------------------------------------------------------------------+-----------+ | id | title | content | distance | +------+--------------------------+--------------------------------------------------------------------------------------------------------+-----------+ | 4 | API 接口調用規範 | 開發者在調用 API 時需攜帶正確的訪問令牌,否則將返回身份驗證錯誤。 | 0.6471552 | | 1 | 系統登錄問題排查 | 當用户無法登錄時,請先檢查網絡連接、瀏覽器緩存,並確認賬號未被鎖定。 | 0.9831962 | +------+--------------------------+--------------------------------------------------------------------------------------------------------+-----------+ 根據文章內容進行文本相關度匹配並推薦(結合INNER PRODUCT) SELECT id, title, content, INNER_PRODUCT(embedding, EMBED("賬單 發票 報銷流程")) AS score FROM support_docs ORDER BY score DESC LIMIT 2; +------+--------------------------+-----------------------------------------------------------------------------------------------+-----------+ | id | title | content | score | +------+--------------------------+-----------------------------------------------------------------------------------------------+-----------+ | 3 | 賬單與發票問題 | 企業客户可在財務模塊下載電子發票。如需紙質版,請提交申請工單。 | 0.8098868 | | 5 | 服務中斷應急流程 | 若出現服務中斷,技術團隊需在30分鐘內啓動應急響應併發布公告。 | 0.3729638 | +------+--------------------------+-----------------------------------------------------------------------------------------------+-----------+ 尋找差異較小的內容(結合L1_DISTANCE) SELECT id, title, content, L1_DISTANCE(embedding, EMBED("服務中斷 處理 指南")) AS distance FROM support_docs ORDER BY distance ASC LIMIT 3; +------+-----------------------------+--------------------------------------------------------------------------------------------------------+----------+ | id | title | content | distance | +------+-----------------------------+--------------------------------------------------------------------------------------------------------+----------+ | 5 | 服務中斷應急流程 | 若出現服務中斷,技術團隊需在30分鐘內啓動應急響應併發布公告。 | 13.94832 | | 2 | 數據備份與恢復指南 | 客户可在管理後台手動備份數據,若誤刪可通過支持團隊申請數據恢復。 | 24.65827 | | 1 | 系統登錄問題排查 | 當用户無法登錄時,請先檢查網絡連接、瀏覽器緩存,並確認賬號未被鎖定。 | 24.9747 | +------+-----------------------------+--------------------------------------------------------------------------------------------------------+----------+ 02 靈活的向量維度控制
通過 Doris 內置的 RESOURCE 機制,用户在配置 AI Resource 時,可以設置 ai.dimensions 參數來精確指定生成向量的維度。用户可以根據具體的業務場景和性能考量,選擇生成高維向量以保留更豐富的語義信息,或選擇低維向量以節約存儲空間和加速計算。這種靈活性使得 EMBED 函數能夠更好地適應從輕量級語義匹配到高精度向量檢索等多樣化的分析需求,讓用户在成本與效果之間找到最佳平衡。
注意:在使用 dimensions 參數時,請務必確認 RESOURCE 中配置的模型支持您所指定的維度,否則可能導致請求錯誤。此外,Doris 內部對部分不支持維度定製的模型(例如 OpenAI 的 text-embedding-ada-002)做了限制。對於這些模型,即使在 RESOURCE 中設置了 dimensions 參數,該設置也將被忽略,函數將返回模型默認的維度。 總結與展望
藉助AI_AGG 與 EMBED 函數,Apache Doris 擁有了了強大的智能分析能力,極大地拓展了數據分析與智能應用的邊界。AI_AGG 通過其動態預聚合機制,將非結構化文本的智能聚合分析帶入數據庫,輕鬆應對海量用户評論、日誌分析等場景。而 EMBED 函數則與 Doris 的向量函數無縫集成,提供從文本到語義向量、再到相似度檢索的一站式解決方案,極大簡化了問答系統、內容推薦等應用的開發。這些功能使得 SQL 語言本身具備了駕馭 AI 模型的能力,讓每一位數據分析師都能以低成本、高效率的方式,挖掘數據中更深層次的語義價值。
展望未來,Doris 將繼續深化 AI 與數據庫的融合。我們將致力於優化模型調度與計算性能,並探索更多如多模態數據分析、AI Agent 交互等前沿功能,持續降低 AI 技術的使用門檻,讓數據驅動的智能決策無處不在。