Skip to content

Commit 7fce09e

Browse files
committed
chore: update confs
1 parent 9c413c5 commit 7fce09e

File tree

1 file changed

+35
-0
lines changed

1 file changed

+35
-0
lines changed

arxiv.json

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37056,5 +37056,40 @@
3705637056
"pub_date": "2024-12-16",
3705737057
"summary": "Despite their remarkable success, large language models (LLMs) have shown limited ability on applied tasks such as vulnerability detection. We investigate various prompting strategies for vulnerability detection and, as part of this exploration, propose a prompting strategy that integrates natural language descriptions of vulnerabilities with a contrastive chain-of-thought reasoning approach, augmented using contrastive samples from a synthetic dataset. Our study highlights the potential of LLMs to detect vulnerabilities by integrating natural language descriptions, contrastive reasoning, and synthetic examples into a comprehensive prompting framework. Our results show that this approach can enhance LLM understanding of vulnerabilities. On a high-quality vulnerability detection dataset such as SVEN, our prompting strategies can improve accuracies, F1-scores, and pairwise accuracies by 23%, 11%, and 14%, respectively.",
3705837058
"translated": "尽管大型语言模型(LLMs)在许多领域取得了显著的成功,但在诸如漏洞检测等应用任务上,其表现仍然有限。我们研究了多种用于漏洞检测的提示策略,并在这一探索过程中,提出了一种将漏洞的自然语言描述与对比性思维链推理方法相结合的提示策略,该方法通过使用合成数据集中的对比样本进行增强。我们的研究表明,通过将自然语言描述、对比性推理和合成示例整合到一个综合的提示框架中,LLMs在检测漏洞方面具有潜力。我们的实验结果显示,这种方法能够增强LLM对漏洞的理解。在如SVEN这样的高质量漏洞检测数据集上,我们的提示策略能够分别将准确率、F1分数和成对准确率提高23%、11%和14%。"
37059+
},
37060+
{
37061+
"title": "Re-calibrating methodologies in social media research: Challenge the\n visual, work with Speech",
37062+
"url": "http://arxiv.org/abs/2412.13170v1",
37063+
"pub_date": "2024-12-17",
37064+
"summary": "This article methodologically reflects on how social media scholars can effectively engage with speech-based data in their analyses. While contemporary media studies have embraced textual, visual, and relational data, the aural dimension remained comparatively under-explored. Building on the notion of secondary orality and rejection towards purely visual culture, the paper argues that considering voice and speech at scale enriches our understanding of multimodal digital content. The paper presents the TikTok Subtitles Toolkit that offers accessible speech processing readily compatible with existing workflows. In doing so, it opens new avenues for large-scale inquiries that blend quantitative insights with qualitative precision. Two illustrative cases highlight both opportunities and limitations of speech research: while genres like #storytime on TikTok benefit from the exploration of spoken narratives, nonverbal or music-driven content may not yield significant insights using speech data. The article encourages researchers to integrate aural exploration thoughtfully to complement existing methods, rather than replacing them. I conclude that the expansion of our methodological repertoire enables richer interpretations of platformised content, and our capacity to unpack digital cultures as they become increasingly multimodal.",
37065+
"translated": "本文从方法论角度探讨了社交媒体研究者如何在其分析中有效利用基于语音的数据。尽管当代媒介研究已广泛接纳文本、视觉和关系型数据,但听觉维度相对而言仍未得到充分探索。基于次级口语性概念和对纯视觉文化的排斥,本文主张在研究中大规模考虑语音和口语能丰富我们对多模态数字内容的理解。文中介绍了TikTok字幕工具包,该工具包提供了易于获取的语音处理功能,并能与现有工作流程无缝兼容。通过这种方式,它为将定量洞察与定性精度相结合的大规模研究开辟了新的途径。两个案例研究分别展示了语音研究的机会与局限:例如,TikTok上的#storytime类型视频通过探索口头叙述受益匪浅,而以非语言或音乐为主导的内容可能无法通过语音数据获得显著的洞察。本文鼓励研究者在现有方法的基础上,审慎地整合听觉探索,而非替代现有方法。结论指出,方法论工具箱的扩展使我们能够更深入地解读平台化内容,并增强我们在数字文化日益多模态化背景下解构这些文化的能力。"
37066+
},
37067+
{
37068+
"title": "C-FedRAG: A Confidential Federated Retrieval-Augmented Generation System",
37069+
"url": "http://arxiv.org/abs/2412.13163v1",
37070+
"pub_date": "2024-12-17",
37071+
"summary": "Organizations seeking to utilize Large Language Models (LLMs) for knowledge querying and analysis often encounter challenges in maintaining an LLM fine-tuned on targeted, up-to-date information that keeps answers relevant and grounded. Retrieval Augmented Generation (RAG) has quickly become a feasible solution for organizations looking to overcome the challenges of maintaining proprietary models and to help reduce LLM hallucinations in their query responses. However, RAG comes with its own issues regarding scaling data pipelines across tiered-access and disparate data sources. In many scenarios, it is necessary to query beyond a single data silo to provide richer and more relevant context for an LLM. Analyzing data sources within and across organizational trust boundaries is often limited by complex data-sharing policies that prohibit centralized data storage, therefore, inhibit the fast and effective setup and scaling of RAG solutions. In this paper, we introduce Confidential Computing (CC) techniques as a solution for secure Federated Retrieval Augmented Generation (FedRAG). Our proposed Confidential FedRAG system (C-FedRAG) enables secure connection and scaling of a RAG workflows across a decentralized network of data providers by ensuring context confidentiality. We also demonstrate how to implement a C-FedRAG system using the NVIDIA FLARE SDK and assess its performance using the MedRAG toolkit and MIRAGE benchmarking dataset.",
37072+
"translated": "组织在利用大型语言模型(LLMs)进行知识查询和分析时,常常面临如何维护一个针对最新信息进行微调的LLM的问题,以确保其回答的准确性和可靠性。检索增强生成(RAG)迅速成为解决这一问题的可行方案,帮助组织克服维护专有模型的挑战,并减少LLM在查询响应中的幻觉现象。然而,RAG在扩展数据管道以跨越不同访问层级和数据源时也带来了自身的难题。在许多情况下,为了给LLM提供更丰富和更相关的内容,必须从多个数据孤岛中进行查询。由于复杂的数据共享政策限制了集中式数据存储,组织内部和跨组织信任边界的数据源分析往往受限,这阻碍了RAG解决方案的快速有效设置和扩展。本文中,我们引入机密计算(CC)技术作为安全联合检索增强生成(FedRAG)的解决方案。我们提出的机密FedRAG系统(C-FedRAG)通过确保上下文机密性,实现了RAG工作流在分散数据提供者网络中的安全连接和扩展。我们还展示了如何使用NVIDIA FLARE SDK实现C-FedRAG系统,并使用MedRAG工具包和MIRAGE基准数据集评估其性能。"
37073+
},
37074+
{
37075+
"title": "AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark",
37076+
"url": "http://arxiv.org/abs/2412.13102v1",
37077+
"pub_date": "2024-12-17",
37078+
"summary": "Evaluation plays a crucial role in the advancement of information retrieval (IR) models. However, current benchmarks, which are based on predefined domains and human-labeled data, face limitations in addressing evaluation needs for emerging domains both cost-effectively and efficiently. To address this challenge, we propose the Automated Heterogeneous Information Retrieval Benchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1) Automated. The testing data in AIR-Bench is automatically generated by large language models (LLMs) without human intervention. 2) Heterogeneous. The testing data in AIR-Bench is generated with respect to diverse tasks, domains and languages. 3) Dynamic. The domains and languages covered by AIR-Bench are constantly augmented to provide an increasingly comprehensive evaluation benchmark for community developers. We develop a reliable and robust data generation pipeline to automatically create diverse and high-quality evaluation datasets based on real-world corpora. Our findings demonstrate that the generated testing data in AIR-Bench aligns well with human-labeled testing data, making AIR-Bench a dependable benchmark for evaluating IR models. The resources in AIR-Bench are publicly available at https://github.com/AIR-Bench/AIR-Bench.",
37079+
"translated": "评估在信息检索(IR)模型的进步中扮演着至关重要的角色。然而,当前基于预定义领域和人工标注数据的基准测试在应对新兴领域的评估需求时,面临着成本效益和效率方面的局限。为解决这一挑战,我们提出了自动化异构信息检索基准(AIR-Bench)。AIR-Bench 具有三个关键特性:1)自动化。AIR-Bench 中的测试数据由大型语言模型(LLMs)自动生成,无需人工干预。2)异构性。AIR-Bench 中的测试数据针对多种任务、领域和语言生成。3)动态性。AIR-Bench 涵盖的领域和语言不断扩展,为社区开发者提供日益全面的评估基准。我们开发了一个可靠且稳健的数据生成管道,基于真实世界语料库自动创建多样化且高质量的评估数据集。研究结果表明,AIR-Bench 中生成的测试数据与人工标注的测试数据高度一致,使得 AIR-Bench 成为评估 IR 模型的可靠基准。AIR-Bench 的资源已在 https://github.com/AIR-Bench/AIR-Bench 公开发布。"
37080+
},
37081+
{
37082+
"title": "CLASP: Contrastive Language-Speech Pretraining for Multilingual\n Multimodal Information Retrieval",
37083+
"url": "http://arxiv.org/abs/2412.13071v1",
37084+
"pub_date": "2024-12-17",
37085+
"summary": "This study introduces CLASP (Contrastive Language-Speech Pretraining), a multilingual, multimodal representation tailored for audio-text information retrieval. CLASP leverages the synergy between spoken content and textual data. During training, we utilize our newly introduced speech-text dataset, which encompasses 15 diverse categories ranging from fiction to religion. CLASP's audio component integrates audio spectrograms with a pre-trained self-supervised speech model, while its language encoding counterpart employs a sentence encoder pre-trained on over 100 languages. This unified lightweight model bridges the gap between various modalities and languages, enhancing its effectiveness in handling and retrieving multilingual and multimodal data. Our evaluations across multiple languages demonstrate that CLASP establishes new benchmarks in HITS@1, MRR, and meanR metrics, outperforming traditional ASR-based retrieval approaches in specific scenarios.",
37086+
"translated": "本研究引入了CLASP(对比语言-语音预训练),这是一种专为音频-文本信息检索设计的多语言、多模态表示方法。CLASP利用了语音内容与文本数据之间的协同效应。在训练过程中,我们使用了新引入的语音-文本数据集,该数据集涵盖了从虚构到宗教等15个不同类别。CLASP的音频部分结合了音频频谱图与一个预训练的自监督语音模型,而其语言编码部分则采用了一个在超过100种语言上预训练的句子编码器。这种统一的轻量级模型弥合了不同模态和语言之间的差距,增强了其在处理和检索多语言和多模态数据方面的有效性。我们在多种语言上的评估表明,CLASP在HITS@1、MRR和meanR指标上设立了新的基准,在特定场景下优于传统的基于ASR的检索方法。"
37087+
},
37088+
{
37089+
"title": "Enabling Low-Resource Language Retrieval: Establishing Baselines for\n Urdu MS MARCO",
37090+
"url": "http://arxiv.org/abs/2412.12997v1",
37091+
"pub_date": "2024-12-17",
37092+
"summary": "As the Information Retrieval (IR) field increasingly recognizes the importance of inclusivity, addressing the needs of low-resource languages remains a significant challenge. This paper introduces the first large-scale Urdu IR dataset, created by translating the MS MARCO dataset through machine translation. We establish baseline results through zero-shot learning for IR in Urdu and subsequently apply the mMARCO multilingual IR methodology to this newly translated dataset. Our findings demonstrate that the fine-tuned model (Urdu-mT5-mMARCO) achieves a Mean Reciprocal Rank (MRR@10) of 0.247 and a Recall@10 of 0.439, representing significant improvements over zero-shot results and showing the potential for expanding IR access for Urdu speakers. By bridging access gaps for speakers of low-resource languages, this work not only advances multilingual IR research but also emphasizes the ethical and societal importance of inclusive IR technologies. This work provides valuable insights into the challenges and solutions for improving language representation and lays the groundwork for future research, especially in South Asian languages, which can benefit from the adaptable methods used in this study.",
37093+
"translated": "随着信息检索(IR)领域逐渐认识到包容性的重要性,满足低资源语言需求的问题仍然是一个重大挑战。本文首次引入了一个大规模的乌尔都语IR数据集,该数据集通过机器翻译将MS MARCO数据集翻译而成。我们通过在乌尔都语中进行零样本学习的IR方法建立了基线结果,随后将mMARCO多语言IR方法应用于这个新翻译的数据集。研究结果表明,经过微调的模型(Urdu-mT5-mMARCO)达到了平均倒数排名(MRR@10)为0.247和召回率@10为0.439,相较于零样本结果有显著提升,展示了扩展乌尔都语使用者信息检索访问的潜力。通过弥合低资源语言使用者的访问差距,这项工作不仅推动了多语言信息检索研究,还强调了包容性信息检索技术在伦理和社会层面的重要性。本研究为改善语言表达的挑战和解决方案提供了宝贵的见解,并为未来的研究奠定了基础,尤其是在可以从本研究中使用的适应性方法中受益的南亚语言方面。"
3705937094
}
3706037095
]

0 commit comments

Comments
 (0)