Dataset Integration

Last generated: 2026-03-12 16:05

Source: datasets.md

Working with Research Datasets

Paramus provides direct access to curated chemical and materials science datasets. You can install, query, and cross-reference datasets without leaving your research environment.

Available Research Domains

Domain	Datasets	What You Get
Polymer Science	RadonPy, PI1M, OpenMacromolecularGenome, VipEA, OMG-Property-Database, PolyIE	~1M+ polymer structures with physical properties from MD simulations
Computational Chemistry	QM9, QM9S, MSR-ACC-TAE25	134k small molecules with DFT-level energies, HOMO/LUMO, dipole moments
Inorganic / Crystallography	COD, a-Si-24, Anionic-Solvation-Dataset	Crystal structures, amorphous silicon configurations, solvation data
Organic / Solubility	BigSolDB	112,465 experimental solubility records across multiple solvents

Installing a Dataset

Select a dataset tile and click Install. Paramus downloads the data files from their source (Zenodo, GitHub) and prepares them for querying. Original files are never modified — normalized copies and a search index are created alongside them.

Querying by Chemical Properties

Ask questions in natural language through the chat. Paramus translates your request into the right query automatically.

Find soluble compounds in ethanol at room temperature:

“Show me compounds with LogS above -2 in ethanol between 20 and 30 degrees Celsius from BigSolDB”

Screen polymers by glass transition temperature:

“Which polymers in RadonPy have a Tg above 400K and density below 1.2 g/cm3?”

Look up molecular properties by structure:

“Get the HOMO-LUMO gap and dipole moment for all molecules containing a carbonyl group in QM9”

SMILES columns are automatically canonicalized using RDKit, so c1ccccc1 and C1=CC=CC=C1 both find benzene.

Query Methods

Method	Use Case
`dataset.query`	Filter by structure, property ranges, solvents, conditions
`dataset.query_schema`	Inspect available columns, types, and value ranges
`dataset.query_remote`	Query a dataset without downloading it first
`dataset.list`	See all installed datasets
`dataset.get`	Get metadata and file listing for a dataset

Supported File Formats

Paramus handles common research data formats out of the box:

Format	Extensions
Tabular	.csv, .json, .jsonl, .xlsx, .xls, .parquet, .feather
Scientific	.h5, .hdf5, .mat, .npy, .npz
Serialized	.pkl, .pickle
Archives	.tar, .tar.gz, .tar.bz2, .zip (auto-extracted)

Use dataset.unfold to convert between formats (e.g. Parquet to CSV).

Semantic Knowledge Graphs

Beyond tabular datasets, three RDF knowledge graphs capture domain-specific research context:

Knowledge Graph	Focus
Polymer Chemistry R&D	Polymer synthesis, characterization, and property prediction
Medicinal Chemistry (Molidustat)	HIF-PHD inhibitor research, SAR relationships
Germanium Extraction R&D	Hydrometallurgical processing, extraction optimization

These are managed separately via semantic.list, semantic.switch, and semantic.info.

Dataset Metadata

Each dataset card follows the Croissant 1.0 + Schema.org standard, capturing provenance, licensing, and citation:

{
  "@type": "Dataset",
  "name": "BigSolDB",
  "dataOrigin": "experimental",
  "measurementTechnique": "Various experimental methods",
  "license": "CC-BY-4.0",
  "citation": {
    "name": "BigSolDB: Solubility Dataset of Compounds in Organic Solvents",
    "identifier": "10.1038/s41597-023-02..."
  }
}

This ensures every query result can be traced back to its original publication and data source.