Shreya Shankar
University of California, Berkeley
shreyashankar@berkeley.edu
Bio
Shreya Shankar is a final-year PhD student in the Data Systems and Foundations group at UC Berkeley. She builds systems that make AI-powered data processing reliable and cost-efficient, with work spanning both data systems research and its real-world impact. Her papers appear in leading data systems venues such as SIGMOD and VLDB, as well as in other areas of computer science, like UIST, CSCW, and NAACL. Her systems have been adopted in the tech industry by open-source AI tools like LangChain and ChromaDB, and outside tech by organizations such as public defenders and government agencies. Beyond her research, Shreya co-teaches AI Evals for Engineers and PMs, an industry-focused course on evaluation and testing of AI applications that has reached more than 2,000 participants across 500+ companies. Before starting her PhD, Shreya worked as an engineer and earned her undergraduate degree in computer science from Stanford University.
Areas of Research
- Computer Systems
Agentic Query Optimization for Unstructured Data Processing
Organizations increasingly want to process large corpora of unstructured documents with AI – not just to digitize them, but to extract, reason over, and analyze their contents. In practice, however, these methods are often inaccurate, prohibitively expensive, or both. I built DocETL, an open-source declarative system for unstructured data processing. DocETL introduces the concept of agentic query optimization: we define rewrite directives that guide LLM agents through logically equivalent query plans, and a search algorithm that explores these plans to improve accuracy while keeping execution within cost budgets. Then, as adoption grew, we saw two main challenges: usability and cost. To address usability, I built DocWrangler, an IDE that helps users understand their data and iteratively refine their DocETL pipelines. To address cost, I developed rewriting techniques that reduce the expense of common DocETL operations – filters and classifiers – while preserving the accuracy of the original pipelines, with guarantees. Overall, as of September 2025, the DocETL stack has over 3,000 GitHub stars, powering document analysis for public defenders in two major California counties and supporting thousands of unstructured data pipelines across domains such as healthcare, finance, and climate science.