Efficient Expert Witness Profiling with Ray: Automated Key Phrase Extraction from Legal Documents

preview_player
Показать описание
The legal industry faces a challenge in efficiently finding relevant information from millions of caselaw documents. One such challenge where Lawyers often struggle is determining the expertise of expert witnesses and identifying the specific topics and issues they can comment on. We have developed a solution that extracts key phrases and then identifies common facts and subjects from a large corpus of expert-associated caselaw documents, such as depositions, opinions, CVs, reports, and jury verdicts.

Our solution pipeline has two major parts: general data pre-processing on AWS EMR using Spark and last mile processing using Python on AWS Sagemaker. In our proof-of-concept (POC) study, we experimented with Ray as an alternative to AWS Sagemaker for the last-mile processing, which involves feature generation, unsupervised learning, and natural language processing (NLP) techniques.

As an example, consider legal opinions. These tell the story of the case: what the case is about, how the court is resolving the case, and why. For this use case, there is whole lot of information/text that we want to discard and extract only the key phrases comprising of information on what an expert has said on a particular subject matter or issue.

In the first component of our solution pipeline - key phrase extraction, we deployed an algorithm that ranks phrases and removes unwanted ones. We used the Spacy library for NLP-based pre-processing and Ray Dataset API to speed up the processing time. We saw a 5x reduction in processing time to rank and filter unwanted phrases.

Using a pre-trained language model, we further removed phrases that were not relevant in helping end users understand 'what an Expert Witness is an expert in' or 'what exactly they can comment on'. We achieved a 24x reduction in processing time to filter unwanted phrases using Ray Dataset and ActorPoolStrategy.

Finally, using an unsupervised learning algorithm, we processed the remaining phrases into similar key phrases. To improve the quality of key phrases, we removed those of low quality that didn't add any value and optimized the number of facts and subjects for each expert, achieving an 11x reduction in compute time using Ray.

Our proof-of-concept (POC) study with Ray was promising and resulted in a faster and more efficient way of identifying relevant facts and subjects. In the next few months, we would be scaling out the pipeline on multi-node Ray cluster. We would be happy to share our insights on using Ray in production at the conference in September.

About Anyscale
---
Anyscale is the AI Application Platform for developing, running, and scaling AI.

If you're interested in a managed Ray service, check out:

About Ray
---
Ray is the most popular open source framework for scaling and productionizing AI workloads. From Generative AI and LLMs to computer vision, Ray powers the world’s most ambitious AI workloads.

#llm #machinelearning #ray #deeplearning #distributedsystems #python #genai
Рекомендации по теме