A Java-based application that uses LLMs (via LangChain4j and Ollama) to analyze data files (.csv and .tab) and identify variables containing location and time/date information.
The application scans a specified directory (recursively), reads each data file, and uses an LLM to determine if any of the columns represent:
- Location: Cities, countries, coordinates, addresses, latitude, longitude, etc.
- Time/Date: Years, months, timestamps, dates, durations, etc.
It outputs whether each file meets the requirements (contains both a location and a time variable).
- Java 17 or higher.
- Maven for building the project.
- Ollama installed and running locally (or accessible via network).
- Ensure you have the
llama3.2model (or your preferred model) pulled in Ollama:ollama pull llama3.2
- Ensure you have the
The application is configured via src/main/resources/application.properties:
ollama.url: The base URL for the Ollama API (default:http://localhost:11434).ollama.model: The LLM model to use (default:llama3.2).analyzer.search-root: The root directory to scan for data files (default:data).
mvn clean compilemvn exec:javaAlternatively, if you've already compiled:
mvn exec:java -Dexec.mainClass="edu.harvard.iq.datacommons.analyzer.Application"- Scanning: The
AnalyzerServicewalks the directory tree starting fromanalyzer.search-root. - Parsing: For each
.csvor.tabfile, it reads the header and the first 5 rows of data. - LLM Analysis: For each column, it sends a prompt to the Ollama model (using LangChain4j) containing the column label and sample values.
- Classification: The LLM responds with
YESorNOto classify if the column represents a location or time/date. - Results: The application prints the analysis results for each file to the console.
- Copying: If a file is identified as having both a location and a time/date variable, it is copied to a new directory named
DataCommonsReady-<timestamp>(e.g.DataCommonsReady-20240315-103000).- The original directory structure is preserved within this directory.
- For example, if
data/subdir/file.csvmeets the requirements, it will be copied toDataCommonsReady-<timestamp>/subdir/file.csv.
The DataCommonsReady-<timestamp> directory will contain all files that are deemed "compliant" (containing both Location and Time data). This is useful for downstream processing that requires these specific dimensions.
src/main/java/edu/harvard/iq/datacommons/analyzer/Application.java: Main entry point.src/main/java/edu/harvard/iq/datacommons/analyzer/AnalyzerService.java: Core analysis logic.src/main/resources/application.properties: Configuration settings.pom.xml: Maven dependencies and build configuration.data/: Sample data directory.