Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
158 changes: 158 additions & 0 deletions src/pages/gsoc_ideas.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Below is a list of project ideas. Feel free to contact the listed mentors on Sla
2. [Benchmarking and Validation Framework](#validation)
3. [Increase PEcAn modularity](#module)
4. [Standardizing Model Couplers Across Models](#couplertools)
5. [LLM-Assisted Extraction of Agronomic Experiments into BETYdb](#llm-betydb)

---

Expand Down Expand Up @@ -174,3 +175,160 @@ Medium (175hr) or Large (350 hr) depending on number of deliverables

**Difficulty:**
Medium

---
### 5. LLM-Assisted Extraction of Agronomic Experiments into BETYdb{#llm-betydb}

Manual extraction of agronomic and ecological experiments from scientific literature into BETYdb is slow, error-prone, and labor-intensive. Researchers must interpret complex experimental designs, reconstruct management timelines, identify treatments and controls, handle factorial structures, and link outcomes with correct covariates and uncertainty estimates—tasks that require scientific judgment beyond simple text extraction. Current manual workflows can take hours per paper and introduce inconsistencies that compromise downstream data quality and meta-analyses.

This project proposes a human-supervised, LLM-based system to accelerate BETYdb data entry while preserving scientific rigor and traceability. The system will ingest PDFs of scientific papers and produce upload-ready BETYdb entries (sites, treatments, management time series, traits, and yields) with every field labeled as extracted, inferred, or unresolved and linked to provenance evidence in the source document. The system leverages existing labeled training data (scientific papers with ground-truth BETYdb entries).

The architecture follows a two-layer design: (1) a schema-validated intermediate representation (IR) preserving evidence links, confidence scores, and flagged conflicts, and (2) a BETYdb materialization layer that enforces BETYdb semantics, validation rules, and generates upload-ready CSVs or API payloads with full audit trails. Implementation is flexible—ranging from agentic LLM workflows to fine-tuned specialist models to an adaptive hybrid—and should be informed by empirical evaluation during the project.

**Expected outcomes:**

A successful project would complete the following tasks:

* IR schema definition with validation rules and documented field semantics covering sites, treatments, managements, and traits/yields
* Modular extraction pipeline for document parsing, information extraction, and IR generation with clear separation between extraction and validation logic
* Independent validators for BETYdb semantics, unit consistency, temporal logic, and required fields
* BETYdb export module producing upload-ready management CSVs and bulk trait upload formats with full provenance preservation
* Scientist-in-the-loop review interface for approving, correcting, or rejecting extracted entries with inline evidence and confidence scores
* Evaluation harness with automated metrics for extraction accuracy, inference quality, coverage, and time savings on held-out test papers
* Documentation covering IR schema specification, developer guidance for adding new extraction components, and user guidance for the review interface

**Prerequisites:**

- Required: R Shiny, Python (familiarity with scientific literature and experimental design concepts)
- Helpful: experience with LLM APIs (Anthropic, OpenAI) or fine-tuning frameworks, knowledge of BETYdb schema and workflows, familiarity with agronomic or ecological experimental designs

**Contact person:**

Nihar Sanda (@koolgax99), David LeBauer (@dlebauer)

**Duration:**

Large (350 hr)

**Difficulty:**

Medium to High

<!--


# This comment section for ideas that may be potentially viable in future (with revision)


---

### 4. Development of Notebook-based PEcAn Workflows{#notebook}

The PEcAn workflow is currently run using either a web based user interface, an API, or custom R scripts. The web based user interface is easiest to use, but has limited functionality whereas the custom R scripts and API are more flexible, but require more experience.

This project will focus on building Quarto notebooks that provide an interface to PEcAn that is both welcoming to new users and flexible enough to be a starting point for more advanced users. It will build on existing [Pull Request 1733](https://github.com/PecanProject/pecan/pull/1733).

**Expected outcomes:**

- Two or more template workflows for running the PEcAn workflow.
- Written vignette and video tutorial introducing their use.

**Prerequisites:**

- Familiarity with R.
- Familiarity with R studio and Quarto or Rmarkdown is a plus.

**Contact person:**
David LeBauer @dlebauer, Nihar Sanda @koolgax99

**Duration:**
Medium (175hr)

**Difficulty:**
Medium


#### BETYdb R data package

BETYdb's web front end is built on a version of Ruby on Rails that is functional byt no longer supported. A key feature of BETYdb is that the data is open and accessible.

Building an R data package would make the Trait and Yield data currently in BETYdb more accessible to users beyond the PEcAn community.

**Expected outcomes:**

A successful project would complete a subset of the following tasks:

- An R package containing the data currently hosted in BETYdb.
- Documentation and examples of use.
- Updates to BETYdb documentation.

**Prerequisites:**

- Required: R
- Helpful: R package development; familiarity with relational databases and SQL.

**Contact person:**

David LeBauer (@dlebauer)

**Duration:**

Medium (175hr) to Large (350hr) depending on scope of proposal.

**Difficulty:**

Medium

---

#### [Optimize PEcAn for freestanding use of single packages [R package development]](#freestanding)

PEcAn was designed as a system of independent modules, each implemented as its own R package that was intended to be usable either standalone or as part of the full PEcAn system. Subsequent development focused on the most common cross-module workflows has lead to tighter coupling between modules than was originally intended, so that in practice many of the modules are now challenging to use, test, or develop without a full understanding of their interdependencies. Further, some packages expect inputs and outputs in data structures that are only generated by other PEcAn packages but might be more easily provided in standard well-known formats. We seek proposals to re-loosen these couplings by revisiting the design and interface of PEcAn packages through one or more of:

1. Refactoring code to remove unneeded dependencies, simplify package interfaces, and exchange data in standard formats
2. Identifying exported functions that are not core to the functionality of the package, and removing them or making them internal
3. Writing tests and examples that demonstrate freestanding use of the package
4. Developing methods for tracking the dependencies between packages that cannot be eliminated, including how these change between package versions
Proposals for this project should choose a subset of these approaches and apply them to a specified subset of the PEcAn packages. Strong proposals will clearly show why each chosen package should be a priority, how it will become more independent at the completion of the project, and what interface changes will be needed to achieve this.

**Expected outcome:**

- One or more PEcAn packages can be installed, used, and/or tested without the user needing to know [something previously important] about [another package].

**Prerequisites:**

- Familiarity with R, especially how it manages dependencies between packages, and with concepts of software package development. Helpful resources: [rOpenSci packages](https://devguide.ropensci.org/index.html) and [R packages](https://r-pkgs.org). Experience with multi-package code bases will be very helpful.

**Contact person:**
Chris Black @infotroph, Shashank Singh @moki1202

**Duration:**
Flexible to work as either a Small (175hr) or Large (350 hr)

**Difficulty:**
Medium, Large
---

#### [PEcAn model coupling and development [Data Science]](#coupling)

PEcAn has the capability to interface multiple ecological models. The goal of this project is to improve the coupling of existing models to PEcAn (specifically FATES) and add new models (specifically a simple vegetation model that is under development). It is also possible to contribute to the development of the simple vegetation model which is written in Fortran.

**Expected outcome:**

- New or improved PEcAn model packages.

**Prerequisites:**

- R, Fortran is an advantage.

**Contact person:**
Hui Tang @Hui Tang, Istem Fer @istfer

**Duration:**
Flexible to work as either a Small (175hr) or Large (350 hr)

**Difficulty:**
Medium

---
-->