A comprehensive platform for extracting, processing, and analyzing German commercial register data with focus on angel investor networks and startup ecosystems.
This platform consists of multiple microservices that work together to:
- Extract company data from German commercial registers
- Process and standardize extracted information using LLMs
- Analyze angel investor networks and startup ecosystems
- Generate insights through data visualization and network analysis
cr_exploration/
βββ cr_extraction/ # Commercial register data extraction service
βββ llm_data_standardization/ # LLM-powered data standardization
βββ structured_info/ # Structured information extraction
βββ structured_info_for_shareholders/ # Shareholder-specific data processing
βββ table_extraction/ # PDF table extraction service
βββ notebooks/ # Jupyter notebooks for data analysis
βββ tests/ # Test suite
βββ template.env # Environment variables template
Purpose: Extract company and shareholder data from German commercial registers
Key Functions:
search_companies(): Search companies by namesearch_companies_by_id(): Search companies by IDdownload_files(): Download commercial register documentsget_shareholder_structured_info(): Extract structured shareholder information
Dependencies: Flask, Google Cloud Storage, MechanicalSoup, BeautifulSoup
Purpose: Standardize extracted data using OpenAI's language models
Key Functions:
standardize_data(): Process and standardize company data
Dependencies: OpenAI, Pandas, Flask
Purpose: Extract structured information from commercial register documents
Key Functions:
get_structured_content(): Extract structured content from documents
Dependencies: OpenAI, Pandas, Flask
Purpose: Extract tabular data from PDF documents
Key Functions:
extract_table(): Extract tables from PDF files
Dependencies: OpenAI, Pandas, Flask
The notebooks/ directory contains comprehensive analysis of angel investor networks:
- Network Analysis: Angel investor co-investment networks
- Descriptive Statistics: Analysis of angels, startups, and network characteristics
- Community Detection: Identification of investor communities
- Geographic Analysis: Regional investment patterns
- Industry Analysis: Investment patterns across sectors
angels.csv: Angel investor profiles and characteristicsstartups.csv: Startup company informationshareholder_relations_angel.csv: Angel-startup investment relationships
- Python 3.8+
- Google Cloud Platform account
- OpenAI API key
- Copy
template.envto.env - Configure your environment variables:
GOOGLE_APPLICATION_CREDENTIALS=Path/To/Google/Key.json
OPENAI_API_KEY=your-openai-api-key
ENV=prod
FORM_RECOGNIZER_ENDPOINT=your-form-recognizer-endpoint
FORM_RECOGNIZER_KEY=your-form-recognizer-key# Install dependencies for each service
cd cr_extraction && pip install -r requirements.txt
cd ../llm_data_standardization && pip install -r requirements.txt
cd ../structured_info && pip install -r requirements.txt
cd ../table_extraction && pip install -r requirements.txtEach service is designed to deploy as a Google Cloud Function:
# Deploy commercial register extraction
gcloud functions deploy search_companies \
--runtime python39 \
--trigger-http \
--source cr_extraction/
# Deploy data standardization
gcloud functions deploy standardize_data \
--runtime python39 \
--trigger-http \
--source llm_data_standardization/# Run services locally using Functions Framework
cd cr_extraction && functions-framework --target search_companies --debug
cd ../llm_data_standardization && functions-framework --target standardize_data --debugimport requests
# Search for a company
response = requests.post('your-function-url/search_companies',
json={'name': 'Company Name'})
companies = response.json()# Download commercial register documents
response = requests.post('your-function-url/download_files',
json={
'company_id': 'company_id',
'documents': ['extract', 'shareholder_list']
})# Standardize extracted data
response = requests.post('your-function-url/standardize_data',
json={'company_id': 'company_id'})
standardized_data = response.json()Run the test suite:
python -m pytest tests/- Data Extraction: Use
cr_extractionto gather company data - Data Processing: Use
llm_data_standardizationto clean and standardize data - Information Extraction: Use
structured_infoandtable_extractionfor detailed analysis - Network Analysis: Use notebooks for angel investor network analysis
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
This project is proprietary and confidential.
- Web Scraping: MechanicalSoup, BeautifulSoup
- Cloud Services: Google Cloud Storage, Google Cloud Functions
- AI/ML: OpenAI API
- Data Processing: Pandas, NumPy
- Network Analysis: NetworkX
- Visualization: Matplotlib, Plotly
Note: This platform is designed for research and analysis of German commercial register data. Ensure compliance with data protection regulations when using this tool.