qa.process

Process the raw TeraChem output for future analysis.

Module Contents

Functions

get_pdb(→ str)

Searches all directories recursively for a PDB file.

get_xyz(→ str)

Searches all directories for a XYZ file.

get_atom_count(→ int)

Finds an xyz file and gets the number of atoms.

combine_xyzs(→ None)

Combine an arbitrary number of xyz files.

get_protein_sequence(→ List[str])

Gets the full amino acid sequence of your protein.

get_charge_file(→ str)

Searches all directories for a charge xls file.

combine_sp_xyz()

Combines single point xyz's for all replicates.

combine_restarts_old(→ None)

Collects all charges or coordinates into single xls and xyz files.

combine_restarts_new(→ None)

Collects all charges or coordinates into single xls and xyz files.

combine_replicates(→ None)

Collects charges or coordinates into a xls and xyz file across replicates.

summed_residue_charge(charge_data, template)

Sums the charges for all atoms by residue.

get_residue_identifiers(→ List[str])

Gets the residue identifiers such as Ala1 or Cys24.

xyz2pdb(→ None)

Converts an xyz file into a pdb file.

xyz2pdb_traj(→ None)

Converts an xyz trajectory file into a pdb trajectory file.

xyz2pdb_ensemble(→ None)

Converts an xyz trajectory file into a pdb ensemble.

clean_incomplete_xyz(→ None)

For removing incomplete frames during troublshooting.

check_valid_resname(→ Tuple[str, int])

Checks if a valid resname has been identified.

get_res_atom_indices(→ List[int])

For a residue get the atom indices of all atoms in the residue.

clean_qm_jobs(→ None)

Cleans all QM jobs and checks for completion.

combine_qm_charges(→ None)

Combines the charge_mull.xls files generate by TeraChem single points.

combine_qm_replicates(→ None)

Combine the all_charges.xls files for replicates into a master charge file.

string_to_list(→ List[List[int]])

Converts a list of numerical strings to a list of lists of numbers.

simple_xyz_combine()

Takes all xyz molecular structure files in the current directory

qa.process.get_pdb() str

Searches all directories recursively for a PDB file.

If more than one PDB is found it will use the first one. If no PDB file was found, it will prompt the user for a PDB file path.

Returns:

pdb_file – The path of a PDB file within the current directory (recursive).

Return type:

str

Notes

Currently it uses the name to distinguish single structures, ensembles, and trajectories. In the future, this function should check the contents to confirm.

qa.process.get_xyz() str

Searches all directories for a XYZ file.

If more than one XYZ is found it will use the first one. If no XYZ file was found it will prompt the user for the XYZ file path.

Returns:

xyz_name – The path of a XYZ file within the current directory.

Return type:

str

qa.process.get_atom_count() int

Finds an xyz file and gets the number of atoms.

Returns:

atom_count – The number of atoms in the identified xyz file.

Return type:

int

qa.process.combine_xyzs() None

Combine an arbitrary number of xyz files.

When generating the input for the QM calculations, you may have created a directory of single xyz strucutres. This script will recombine them back into a single xyz trajectory.

qa.process.get_protein_sequence(pdb_path) List[str]

Gets the full amino acid sequence of your protein.

See also

qa.plot.heatmap

qa.process.get_charge_file() str

Searches all directories for a charge xls file.

If more than one .xls file is found it will use the first one. If no .xls file was found it will prompt the user for the .xls file path. This is the standard charge output for TeraChem.

Returns:

charge_file – The path of a charge .xls file within the current directory.

Return type:

str

Notes

Starts in the directory containing all the directories

qa.process.combine_sp_xyz()

Combines single point xyz’s for all replicates.

The QM single points each of a geometry file. Combines all those xyz files into. Preferential to using the other geometry files to insure they are identical.

Returns:

replicate_info – List of tuples with replicate number and frame count for the replicates.

Return type:

List[tuple()]

qa.process.combine_restarts_old(atom_count, all_charges: str = 'all_charges.xls', all_coors: str = 'all_coors.xyz') None

Collects all charges or coordinates into single xls and xyz files.

Likely the first executed function after generating the raw AIMD data. Trajectories were likely generated over multiple runs. This function combines all coordinate and charge data for each run.

Parameters:
  • all_charges (str) – The name of the file containing all charges in xls format.

  • all_coors.xyz (str) – The name of the file containing the coordinates in xyz format.

  • atom_count (int) – The number of atoms in the structure

Notes

Run from the directory that contains the run fragments.

See also

combine_replicates

Combines all the combined trajectories of each replicate into one.

qa.process.combine_restarts_new(atom_count, all_charges: str = 'all_charges.xls', all_coors: str = 'all_coors.xyz') None

Collects all charges or coordinates into single xls and xyz files.

This version determines overlaps by parsing the frame numbers in the title lines of coors.xyz and adjusts the corresponding lines in charges.xls.

Parameters:
  • all_charges (str) – The name of the file containing all charges in xls format.

  • all_coors (str) – The name of the file containing the coordinates in xyz format.

  • atom_count (int) – The number of atoms in the structure.

Notes

Run from the directory that contains the run fragments.

qa.process.combine_replicates(all_charges: str = 'all_charges.xls', all_coors: str = 'all_coors.xyz') None

Collects charges or coordinates into a xls and xyz file across replicates.

Parameters:
  • all_charges (str) – The name of the file containing all charges in xls format.

  • all_coors.xyz (str) – The name of the file containing the coordinates in xyz format.

Notes

Run from the directory that contains the replicates. Run combine_restarts first for if each replicated was run across multiple runs. Generalized to combine any number of replicates.

See also

combine_restarts

Combines restarts and should be run first.

qa.process.summed_residue_charge(charge_data: pandas.DataFrame, template: str)

Sums the charges for all atoms by residue.

Reduces inaccuracies introduced by the limitations of Mulliken charges.

Parameters:
  • charge_data (pd.DataFrame) – A DataFrame containing the charge data.

  • template (str) – The name of the template pdb for the protein of interest.

Returns:

sum_by_residues – The charge data averaged by residue and stored as a pd.DataFrame.

Return type:

pd.DataFrame

qa.process.get_residue_identifiers(template, by_atom=True) List[str]

Gets the residue identifiers such as Ala1 or Cys24.

Returns either the residue identifiers for every atom, if by_atom = True or for just the unique amino acids if by_atom = False.

Parameters:
  • template (str) – The name of the template pdb for the protein of interest.

  • by_atom (bool) – A boolean value for whether to return the atom identifiers for all atoms

Returns:

residues_indentifier – A list of the residue identifiers

Return type:

List(str)

qa.process.xyz2pdb(xyz_list: List[str]) None

Converts an xyz file into a pdb file.

Parameters:

xyz_list (List(str)) – A list of file names that you would like to convert to PDB’s

Note

Make sure to manually check the PDB that is read in. Assumes no header lines. Assumes that the only TER flag is at the end.

qa.process.xyz2pdb_traj(xyz_name, pdb_name, pdb_template) None

Converts an xyz trajectory file into a pdb trajectory file.

Note

Make sure to manually check the PDB that is read in. Assumes no header lines. Assumes that the only TER flag is at the end.

qa.process.xyz2pdb_ensemble() None

Converts an xyz trajectory file into a pdb ensemble.

Note

Assumes that the only TER flag is at the end.

qa.process.clean_incomplete_xyz() None

For removing incomplete frames during troublshooting.

This is current under construction. I am not sure about its use cases.

qa.process.check_valid_resname(res) Tuple[str, int]

Checks if a valid resname has been identified.

Excepts a resname of the form e.g. Ala1, A1, Gly12, G12. If an incorrect resname is supplied the fuction will exit with an warning.

Parameters:

res (str) – Name of a residue of the form e.g. Ala1, Gly12.

Returns:

  • aa_name (str) – The requested amino acid’s three letter code.

  • aa_num (int) – The requested amino acid’s position in the sequence.

qa.process.get_res_atom_indices(res, scheme='all') List[int]

For a residue get the atom indices of all atoms in the residue.

Parameters:
  • res (str) – Name of a residue of the form e.g. Ala1, Gly12.

  • type (str) – The type of atom indices to retrieve e.g., all, backbone

Returns:

residue_indices – A list of all atom indices for a given residue.

Return type:

list

qa.process.clean_qm_jobs(first_job: int, last_job: int, step: int) None

Cleans all QM jobs and checks for completion.

We ran single points at a higher level of theory from the SQM simulations. Some jobs will inevitable die do to memory or convergence issues. It is important to check that all the jobs finished successfully. This script checks that all the jobs finished. Once it has confirmed that all jobs finished sucessfully, it will clean up the QM by deleting log files and scratch directories.

Parameters:
  • first_job (int) – The name of the first directory and first job e.g., 0

  • last_job (int) – The name of the last directory and last job e.g., 39900

  • step (int) – The step size between each single point.

qa.process.combine_qm_charges(first_job: int, last_job: int, step: int) None

Combines the charge_mull.xls files generate by TeraChem single points.

After running periodic single points on the ab-initio MD data, we need to process the charge data so that it matches the SQM data. This code gets the charges from each single point and combines them. Results are stored in a tabular form.

Parameters:
  • first_job (int) – The name of the first directory and first job e.g., 0

  • last_job (int) – The name of the last directory and last job e.g., 39901

  • step (int) – The step size between each single point.

qa.process.combine_qm_replicates() None

Combine the all_charges.xls files for replicates into a master charge file.

The combined file contains a single header with atom numbers as columns. Each row represents a new charge instance. The first column indicates which replicate the charge came from.

qa.process.string_to_list(str_list: List[str]) List[List[int]]

Converts a list of numerical strings to a list of lists of numbers.

It takes a list of numerical strings so that it can process them in bulk.

Examples

[“1-4,6,8-10”, “1-3”] -> [[1,2,3,4,6,8,9,10],[1,2,3]]

qa.process.simple_xyz_combine()

Takes all xyz molecular structure files in the current directory and combines them to create a single xyz trajectory.

Notes

The output xyz trajectory file will have no additional white space and will have each xyz concatenated after the next. The output xyz will be called combined.xyz