2024 Metcalf Summer Research Assistant
This internship is part of the University’s Jeff Metcalf Internship Program. Please review to learn more about benefits and program requirements for Metcalf interns.
Please make sure that if selected for an interview, you communicate to your prospective host organization/employer where you will be physically located during the internship as your location may affect your (or your host organization/employer's) ability to pursue this opportunity.
If you are an international student, please make sure to visit the OIA website to familiarize yourself with your work authorization eligibility and requirements as soon as possible. If you’d like to make an appointment with your international adviser, please visit this page.
Fellowship Award Amount: $4,000
Internship Time Commitment: 8+ weeks, Full-Time (320+ Hours)
Center for Translational Data Science:
The Center for Translational Data Science at the University of Chicago is developing the discipline of data science and its applications to problems in biology, medicine, healthcare and the environment. We develop and operate large scale data platforms to support research in topics of societal interest including cancer, cardiovascular disease, inflammatory bowel disease (IBD), birth defects, veterans’ health, pain management, opioid use disorder, and environmental science. We also develop new machine learning and AI algorithms over the data in our platforms.
Multiple projects are available with the Center for Translational Data Science for summer 2024. These projects are detailed below.
Please indicate a project or projects of interest in your cover letter.
Project #1: Gen3 Data Lakehouse
Mentor(s): Alex VanTol, Michael Lukowski
Project Description
Gen3 is an open source data platform for managing, analyzing and sharing biomedical data. The current design requires substantial configuration to establish a data model with required metadata fields, which can be very time consuming for a new data commons operator to set up. We want to prototype a new approach where files can be easily distributed with minimal metadata; we will update some of the data ingestion and storage methods in Gen3 to adopt additional data lake features. This project will involve in-depth knowledge and opportunity to learn about graph models, converting data in serialized Avro objects, and querying Avro objects.
We want to create a generalized method and tooling to convert data in a data warehouse (in a graph model) into an Avro-based serialized file for ingestion into a data lake. We’ve developed an Avro-based file format called the Portable Format for Bioinformatics (PFB), which is used for exporting and importing bulk clinical and other structured data from one component of a data platform for health care data to another component. The work necessary in this project is designing the methodology and tooling to support the creation of these PFB files from arbitrary graph models (ideally converting from other common representations of such data). If time allows, such client-side tooling can also include querying and visualization tools to see the graph model contained within a PFB and/or aid in the creation of new models for ingesting data.
Expected Outcomes
- Extensible methodology and tooling for creating a serialized file from a graph model and related data
Skills Required/Preferred
- Required:
- Python programming experience
- Preferred:
- Knowledge of graph models
- Some familiarity with serialization formats (specifically, Avro)
- Knowledge of UX and/or client side tooling
Expected Size and Difficulty
- 175 hours
- Difficulty - Medium
Project #2: Develop AI search tool for exploring published cancer research
Mentor(s): Aarti Venkat, Lauren Mogil, Alexander Vantol
Project Description
Petabytes of data have been produced from various cancer projects, but there is currently no easy way for a researcher to search and discover datasets jointly from all these projects. For example: I’m a researcher interested in thyroid cancer. I’d like to know how much data is available for thyroid cancer to date, and how can I access it? Large language models (LLM) are a powerful resource to search and discover over project-level metadata and can be exploited for this purpose. We are working on compiling a list of all published cancer initiatives (including projects, repositories, data commons, databases, and data meshes) and associated project-level metadata. The new model will search across this collection of data.
At CTDS, we have recently implemented a simple retrieval augmented generation AI model (RAG model) called Gen3 Discovery AI for researchers to search and discover over arbitrary data within a Gen3 data commons or mesh. It will have a dedicated front-end for researchers to ask simple questions and the model returns associated datasets. The goal of this project is to implement a similar model to search and discover over studies in cancer. Our goal is to use our compiled list of cancer initiatives and openAPIs to bring study-level metadata in and create a metadata store for the RAG model. Initial approaches could use Google Vertex AI’s PaLM models (which is the default in the new Gen3 service) and be optimized later to more custom models.
Expected Outcomes
- Configuring and loading data in Gen3 Discovery AI for cancer research
- A simple search and retrieval strategy (e.g using Google Vertex AI methods) that works
- A simple front end that accepts query prompts and returns list of relevant studies
Skills Required/Preferred
- Required:
- Python
- AI/ML experience
- Preferred:
- Large language models experience
- openAI search and discovery tools
Expected Size and Difficulty
- Number of hours 350
- Difficulty - Medium
Project #3: Incorporate Support for Frictionless into Gen3
Mentor(s): Phil Schumm, Michael Lukowski
Project Description
Gen3 is an open source framework for building data commons, meshes and fabrics, and is developed and maintained by the Center for Translational Data Science. Gen3 platforms can store and distribute structured (e.g., clinical) data via a flexible graph model, semistructured data and metadata via a key-value store, and data files stored in several popular cloud services (e.g., GCP, AWS and Azure).
Frictionless (https://frictionlessdata.io/) is an open-source toolkit implementing a set of open standards for packaging data and metadata designed to be easy to use and to handle data at scale (e.g., streaming). Frictionless data packages are portable, extensible and software agnostic, and are used by a wide variety of research, commercial and governmental organizations across a broad range of scientific disciplines. Frictionless could provide a powerful solution to simplify moving data into and out of a Gen3 Data commons, as well as facilitating analysis of those data.
The primary purpose of this project will be to develop and implement a flexible and robust mechanism for translating data and metadata from Portable Format for Bioinformatics (PFB)—an Apache Avro-based efficient and performant format native to Gen3—to a Frictionless data package. For a specific study, this might include information about the study, structured data such as from a medical record, and one or more data files such as an MRI. Bundling these data into a standalone, versioned data package, which can be used with a wide variety of analytic software, will substantially increase the scientific value and ease of use of data stored in Gen3 platforms.
Expected Outcomes
- A robust and flexible mechanism for translating between PFB files and Frictionless data packages (i.e., in both directions). It is anticipated that this mechanism will be incorporated as a component within Gen3 (i.e., as opposed to a stand-alone piece of software), permitting users to retrieve Frictionless data packages for use locally or within a Gen3 workspace. Functionality within Gen3 (i.e., button for download through the GUI and API support for programmatic use) to download all (or selected) data from within a study in the form of a Frictionless data package.
Skills Required/Preferred
- Required:
- Python programming experience, ideally in writing software to manipulate data
- Preferred:
- Familiarity with either Gen3 and/or Frictionless
- Familiarity with an established, open-source project for manipulating data (e.g., NumPy, Pandas, Polars, etc.)
- Experience working with biomedical research data
Expected Size and Difficulty
- 350 hours
- Difficulty - Medium