Projects
Work Projects
LinuxBench Development for Redwood Research (Jun 4, 2025 – Aug 12, 2025; ongoing)
Developer for a Docker-based agent evaluation platform from inception. 12 PRs merged covering setting development, task ideation, solution and scorer implementation, research tooling for the evaluation suite. Heavily reworked a setting to add complexity; implemented 2 main and 2 side tasks; provided baseline solutions for 2 task-combination variants.
Technologies/Concepts: Python, Docker, agent evaluation, task development, platform development
ControlArena Development for AISI (Mar 14, 2025 – Jun 11, 2025)
Worked on a Kubernetes-based agent evaluation platform for the UK Government. 13 PRs merged across control-arena and ca-k8s-infra over 2 tasks, including task ideation, solution implementation, and scorer implementation; contributed research on task ideation.
Technologies/Concepts: Python, Kubernetes, agent evaluation, task development, infrastructure
Contractor Research Engineer for Equistamp (2025–Current)
Transitioned from METR as a contractor, continuing LLM evaluation work. Worked on video data labeling of human experts resolving software development tasks for METR, and contributed to development and testing of platform migration from Vivaria to an inspect-based platform.
Technologies/Concepts: Python, video data labeling, platform migration, LLM evaluation
Model Evaluation and Task Engineer for METR (2024–2025)
Worked on LLM agent evaluation and task development. Labeled and graded LLM agent runs on several task suites, performed human expert baselines for tasks, and contributed to task development and fixes. Acknowledged in the HCAST paper for contributions to human-calibrated autonomy software tasks.
Technologies/Concepts: Python, LLM evaluation, data labeling, task development, human baselines
Backend developer for IBM Zurich
Contract developer from Softinsa. Developed internal backend systems and open source utility functions on a large project with a sizable team, all disclosable information can be seen at the public page and some code
Technologies/Concepts: Python, python-fastAPI, python-pydantic, python-typer, git, Poetry, Kubernetes, OpenShift
Developer for Softinsa
Built an automated knowledge test generation, correction and forwarding platform, that softinsa uses internally to knowledge test new employee candidates. in it the administrators can:
- Generate a test containing any amount of questions under a given topic.
- Manage questions and topics existing in the database
- Forward tests via URL to the candidates
- Manage existing tests and see their respective results
- See a dashboard containing several KI, such as counts of questions per topic, generated test completion status, completed test grades Users can:
- Access a single attempt timed test generated by an administrator, answer the multi-choice questions and submit the test for automatic evaluation.
Technologies/Concepts: Python, python-flask, MongoDB, Docker, git
Backend developer for IBM Munich
Sole backend developer of a Python project with occasional FE tasks. The system received pictures from a photobooth and was in charge of sending them over to several inference models located in a remote server as well as apply some data processing itself. Then it would submit the results to a MSSQL database and a SAP endpoint. It was an interesting project developed by a small team with very defined roles. Cannot disclose more information due to NDA.
Technologies/Concepts: Python, MSSQL, Angular, Linux scripting, automated log analysis.
A tool for predictive analysis of trends in an IT Service
A proof-of-concept web app that would draw its information from a database where all issue tickets are located. It would put this information in graph format separating issues by categories and offer a predictive analysis of the future amount of issues using Facebook Prophet.
Technologies/Concepts: Python, Facebook Prophet, SQL, Flask.
Degree Projects
Database Development and Database Interface Application development
for a mock up of a complex hospital work-flow. Interface applications in JavaFX, made use of transaction isolation levels to avoid deadlocks and massively increase performance.
Technologies/Concepts: Java, SQL, JavaFX.
Android application for BMI tracking and assistance in BMI index
improvements.
Technologies/Concepts: Java, Android, SQLite.
Android application to guide visitors trough “Centro de Energia Viva da
Montanha” With a facebook like news feed and QR code reading capacity to provide visitors with information regarding the several expositions in the museum.
Technologies/Concepts: Java, Android, SQL.
Something Brawl: Prototype for a game and an authoritative server of a
multiplayer cross platform turn-based card-game.
