Notes to QPweb

These notes are a mixture of a users' guide, release notes, version history, known issues and advices how to overcome them.

Comments and bug reports are welcome at   reiczigel dot jeno at univet dot hu.

(updated on 18 January 2017)


Tips

Known issues     Version history

Data preparation

QPweb assumes that you enter your data locally on your computer by a spreadsheet program (like Excel in MS Office, Numbers on a Mac, Calc in LibreOffice), or by a text editor (like Notepad in Windows or TextEdit on a Mac), and when your data file is ready, you upload it to QPweb for analysis. This means that the data entering screen of QP is missing in QPweb.

What kind of data?

Data prepared for analysis by QPweb should be a so-called data matrix, in which   each row represents a host   and   each column corresponds to a characteristic,   for example host body weight, host age, number of parasite species, number of parasites of a certain species, and so on. These characteristics are also called variables, traits, characters, features, attributes. We'll use the word variable.

It is quite usual that the first row contains the names of the variables rather than the data of a host. As an example, look at the following simple data matrix:

Location HostBodyWeight HostSex Paras1Larvae Paras1Adults Paras2Larvae Paras2Adults
Q14 12.5 Male 0 0 7 12
Q14 11.0 Male 0 0 0 2
Q18 9.3 Female 10 15 16 19
Q14 10.8 Female 0 0 17 10
Q18 16.0 Male 25 20 5 19

QPweb accepts numerical data only. This is inherited from QP, and we don't plan to change this at the moment. Thus the colums "Location" and "HostSex" above will not be accepted by QPweb as they are, you have to use number codes for location and sex   (say, 1 for female, and 2 for male, and 10014 and 10018 for the locations).

QP accepted only whole numbers. Although this wasn't a real restriction, as one could always choose a measurement unit in which data were whole numbers, QPweb is prepared to accept decimal numbers too. In some analysis methods, however, decimal numbers may lead to errors. If you find a bug like that, please tell us about it.

Maximum file size to upload and analyse is 200 kB. If someone wants to work with larger files, should contact us by email.

Uploaded files can be displayed by clicking on the file name, or deleted by clicking on the "x" beside the file name.

Variable names (names of columns of the data matrix)

If your data file contains variable names in the first row, these names will be read in and used in QPweb. Be careful with special characters like comma, semicolon, or accented characters like ä or ô in names (also in file names), as these may cause errors. For structuring names you can safely use two special characters, dot and underscore, and you can also exploit that names in QPweb are case sensitive. Some naming examples:   goose.summer,   goose.automn,   lice_male,   lice_female,   NumberSurvived2011,   PctSurvived2011.

If you don't have variable names, that is, if even the first row in your file contains data rather than names, you should set "No" at "Variable names in the first row" in the "Import Data" screen. Then QPweb gives the default names "Var1", "Var2", "Var3", etc.

Entering data using a spreadsheet program (Excel, Numbers, Calc, etc.)

If you enter your data in a spreadsheet program, save it as a simple text file (it results usually in a file extension ".txt") or a so-called "comma-separated" text file (usually with an extension ".csv"). In a simple text file the values are separated by blanks or tab characters, in a csv by commas or semicolons. (Semicolon is used in countries where the decimal symbol is comma.) Spreadsheet programs usually offer several formats to save your data, you should find out which format is best for reading your data in QPweb. The "Import data" screen of QPweb allows for choosing the appropriate separator character (blank, tab, comma, semicolon, or any other).

If you use decimal numbers, be sure to specify the right decimal symbol in the "Import Data" screen of QPweb. If you are uncertain which is used in your data file, check it by looking into your data file using a text editor.

Entering data using a text editor (Notepad, TextEdit, gedit, etc.)

If you enter your data using a text editor, using comma or semicolon as delimiter between the numbers is safer than using space or tabulator. Note that two delimiters with no number between them (for example , , or ; ;) is interpreted as an empty field, that is one with missing data. Think of this if you get the error message "Too many fields!" This occurs most likely when the delimiter is space or tabulator because these are invisible characters, thus it's not easy to notice if two of them are next to each other.

The above data (location and sex coded by numbers) entered in a text editor looks like this:

    Location; HostBodyWeight; HostSex; Paras1Larvae; Par1Adults; Paras2Larvae; Par2Adults
    10014; 12.5; 2; 0; 0; 7; 12
    10014; 11.0; 2; 0; 0; 0; 2
    10018; 9.3; 1; 10; 15; 16; 19
    10014; 10.8; 1; 0; 0; 17; 10
    10018; 16.0; 2; 25; 20; 5; 19

To import this file use the following settings on the "Import Data" screen:
    - text file
    - field separator: semicolon
    - variables in the first row: yes
    - decimal point character: period

When importing your data file, you can assign a short name to the data set. If you don't give any name, then the file name will be used as data set name without its extension part (.txt, .csv, .doc, etc.) truncated to 18 characters. Giving a short name is useful when
    - your data file has a too long name,
    - the file name contains special characters, or
    - after truncation two files would have the same name.

Using data which you entered in QP 2.0 or 3.0

Previous versions of QP stored the entered data in files with extension ".dat" in the same folder in which the program was located. Such ".dat" files can be imported selecting "QP3.0 data file" in the "Import Data" screen. This doesn't require further parameter settings, as QPweb knows the format of the QP 3.0 files, and reads in the data correctly.

QP 3.0 data files consist of one single column representing the number of parasites per host. If you read in a QP 3.0 data file, the variable name for this single column will always be "Data".

Missing data

Missing data were no problem in QP, as you simply didn't enter them. But in QPweb a data file can have more than one column (that is, several traits for each host), so it may be that for a particular host some data are present and others are missing. If a particular data item is missing, you can write NA in place of it, or you can simply write two delimiters next to each other, with nothing between them. The following two lines result in the same data uploaded:
    1, 5, NA, 10.5, 0.5
    1, 5, , 10.5, 0.5
Although NA is not a number, it will be read and understood by QPweb correctly, as this is the standard code for missing values in the program R.

If you are using space as delimiter, two spaces next to each other are interpreted by QPweb so that a data item is missing. Therefore using space as delimiter is a bit dangerous, it is better to use comma or semicolon as delimiter.

NA,   NaN,   Inf   in the output

Due to missing values in the data, calculations may also result in missing values. If for example all values of a certain variable are missing, their mean or median will also be missing. This is indicated on the output by NA.

Division by zero may also result in a missing value. This may occur in statistics when a statistics is divided by its standard error, and the standard error happens to be zero. A typical example of this is a t-test when data are constant (=all data values are equal). In R, when a positive number is divided by 0, the result is Inf (infinity). If a negative number is divided by 0, the result is -Inf (minus infinity). If zero is divided by zero (0/0), the result is NaN ("not a number").

In R, if all values of a data series are missing, their minimum results in Inf, and their maximum results in -Inf.

Results and graphs

Results are accumulating in a text window, and can be copied from it into a word processor or another program in the standard way of select-copy-paste.

In reporting the results, we follow the $ notation used in the R program, that is, if you have read in a data set named "FoxesSummer2010", and a variable in this data set has the name "TickAdults", then this variable appears in the results as "FoxesSummer2010$TicksAdults".

If such a name is too long to be displayed in the output (>30 characters), its beginning is displayed, followed by three dots. To receive unambiguous reports, give short dataset names and variable names. If your file has a long name, give it a short name for use in QPweb when importing it. Note that the name "FoxesSummer2010$TicksFemales" is still o.k. (28 characters).

In QPweb we intend to follow the convention that p-values from statistical tests should be reported numerically with 4 decimals (rather than just write p < 0.05), except if the p-value is smaller than 0.0001. In such cases the conventional form of reporting is p < 0.0001. If QPweb still reports p=0.000002, p=0 or alike, you should change it in your publication to p < 0.0001.

Sometimes very large or very small numbers are written in the so-called scientific form with a symbol "e" for exponent. For example 1.28e−12 means 1.28 ⋅ 10 −12.

QPweb doesn't store but the last 10 diagrams, that is, when the 11th diagram is created, the first one is deleted, and so on. Users should save the diagrams they want to keep before they are deleted. Diagrams are png files, and can be saved or copied from the graph window in the standard way.

Special notes to Mac users

We have almost no experience with Apple Macintosh computers, in particular we don't know which are the most popular programs for data entering on a Mac. We would appreciate if Mac users posted us their experience about what works fine and what does not. We would include their tips in these notes.

If you enter data in TextEdit, save your data in simple text format (.txt). If TextEdit does not offer this format, you can set it in the "Format" menu before saving.

If you enter data in Excel or Numbers, save your data as "comma separated values" (.csv), as "Windows text" (.txt), or as "MS-Dos text" (.txt). In case of "comma separated values" please check if the delimiter is really a comma (it depends on the local settings, e.g. in Germany it is semicolon, because the comma is reserved for the decimal character).

If you enter data in Word or Pages, save your data as "text only" (.txt).


Version history

Top of page (Tips)     Known issues

1.0.13     2017-01-18
A new modul for the estimation of parasite species richness is included. This aims to estimate the number of parasite species infecting a host, including those unobserved in the actual sample.

The "Chao2" method is used for the estimation, as this was found to perform best for parasite infection data by Walter and Morand (1998), Parasitology, 116, 395-405. This method estimates the number of unobserved parasite species from the number of rare species (occurring only in 1 or 2 hosts in the sample). If there are no such rare species in the sample, the estimation fails. The method performs well if the number of rare species is <50% of all parasite species in the data set. It is also advised that a large sample of hosts is needed to obtain a reliable estimate of species richness (a few hundred hosts are recommended).

The procedure uses an incidence (or abundance) matrix where each row represents a host, and each column represents a parasite species (as usual in QPweb). Be aware that you select only such variables (=columns) of the data set for analysis, which represent infection by some parasites. Values greater than 0 correspond to "present" while zeroes mean "absent" (that is, abundance data are automatically converted to incidence data by the program).

For a detailed description of the method and for correct interpretation of the results see

Chao, A. and Chiu, C.-H. 2016. Species Richness: Estimation and Comparison. Wiley StatsRef: Statistics Reference Online. 1-26.

or the author's original version at

http://chao.stat.nthu.edu.tw/wordpress/paper/114.pdf


1.0.12     2016-04-22
The three new "Group comparisons" moduls are improved. The moduls aborted without any error message when, due to missing values or choosing a wrong grouping variable, there were no groups to compare to each other. Now the program gives error messages in such cases.
The "Group comparisons" moduls require that the data set contains a variable with at least 2 different values, like for example sex of host, age group of host, location of observation, season of year, etc. This variable should be selected as that defining the groups. The groups are then compared with respect to prevalence of a parasite or mean of another variable. Maximum 6 groups are allowed in these procedures (if the selected grouping variable has more than 6 different values, an error message is issued).
Be aware that there are two "Group comparisons" moduls for comparing means. One is made for comparing mean intensities, that is, it uses only the nonzero values to compute the means, and the other for abundance, where zeroes are also included. For host traits other than infection, you should think it over, which one you need. (In most cases zeroes should also be included but it is not always the case.)

1.0.11     2016-02-26
Aggregation indices: a 95% bootstrap BCa CI is computed for Poulin's discrepancy index.
In several procedures only the conventional confidence levels (90, 95, 99%) can be used.
New procedure: two-sample comparison of Poulin's discrepancy index.
New procedure: two-sample comparison of intensity distributions using Neuhäuser's location-scale test.
    This test is sensitive to any difference between the distributions (means, medians, variances, etc). However, the test doesn't tell which feature of the data is responsible for the detected difference. That should be explored by inspection of the descriptive statistics and graphs.
    For details of the test see Neuhauser, M. (2000) An exact two-sample test based on the Baumgartner-Weiss-Schindler statistic and a modification of Lepage’s test, Commun. Statist. Theory and Methods, 29, 67-78.
New procedures: group comparisons of prevalences and mean intensities.
    Data entered in Excel usually contain columns that define some grouping, for example location, time, sex of the host, etc., and you may want to compare prevalence or intensity between these subgroups. For illustration, let us have a look at the example table again.

Location HostBodyWeight HostSex Paras1Larvae Paras1Adults Paras2Larvae Paras2Adults
10014 12.5 2 0 0 7 12
10014 11.0 2 0 0 0 2
10018 9.3 1 10 15 16 19
10014 10.8 1 0 0 17 10
10018 16.0 2 25 20 5 19

    You may want to compare prevalence of adults of the first parasite species between the two locations. Or you may want to compare the mean intensity of parasite 2 larvae between male and female hosts. Until now this was possible only if you splitted the data by location or by host sex, and read in each part of the data set separately. Now groups defined by a variable in the data set can be compared directly.
    If you select a group comparison procedure, you should specify which of the selected variables is the grouping variable. (Similarly to the Scatterplot, where you have to specify which variable should go to the y axis of the graph.) Note that the maximum number of groups to compare is 6.
    Comparison of means is made by bootstrap t-test (for 2 groups) or by bootstrap ANOVA (for more than 2 groups). Comparison of prevalences is made by Fisher's exact test.
    Don't forget that QPweb cannot read in text variables, so use number codes for the grouping variables.

1.0.10     2015-08-16
Enhanced scatterplot: the user can specify which variable to put on the X and Y axis.
New options: now the user can set the confidence level of confidence intervals and in the bootstrap procedures the number of bootstrap replications.
Don't change the conventional 95% confidence level unless you have good reasons to do that.   For example if you receive an error message that a 95% bootstrap CI cannot be constructed even with 10000 replications, change it to 90%.   Or if the 95% CI is disappointingly wide, you can try the 90% CI. If you're lucky, it'll look better.   Or if the 95% CI consists of a single value, this may encourage you to try the 99% CI.   Other values aren't accepted by the procedures.

1.0.9     2014-12-23
A bug related to bootstrap confidence intervals has been fixed. Since the R function for the BCa interval fails when the sample size is greater than the number of bootstrap replications, in such cases we apply the percentile method instead of BCa. Affected procedures are bootstrap confidence intervals for mean intensity, mean abundance, and mean crowding.
This "Notes to QPweb" has become more detailed.

1.0.8     2014-02-28
Attempts to upload too large files caused weird errors with no warning, therefore we limited the file size, and issue an error message if the maximum allowed size (now 200 kB) is exceeded. If someone wants to work with larger files, should contact us by email.

1.0.7     2014-01-24
"Show graphs" is redesigned, so a new diagram doesn't overwrite the previous one. Diagrams are numbered consecutively, and always the last 10 are available for display.
A few minor bugs are fixed.

1.0.6     2013-12-20
Import data:   Apple Macintosh line endings are handled correctly now. (Until now only data saved in "Windows text" format were imported correctly from a Mac.)
Comparison of mean crowding:   When the variability of infection is too little for the bootstrap test (for example, if one of the samples doesn't contain any infected host, or it contains just one, or alike), an error message is issued.
Stochastic equality of intensity distributions:   When the variability of infection is too little for the bootstrap test (for example, if one of the samples doesn't contain any infected host, or it contains just one), an error message is issued.

1.0.5     2013-08-31
Confidence interval for mean intensity:   if there is only one infected host in the sample, an error message is issued because CI cannot be calculated from one single intensity value.
Confidence interval for mean crowding:   if there is only one infected host in the sample, an error message is issued because CI cannot be calculated from one single crowding value.
Scatterplot with Spearman's rank correlation:   output in the results file is completed with the names of variables.
Aggregation indices:   if the negative binomial exponent 'k' cannot be calculated, a short explanation is given about the possible reasons why.

1.0.4     2013-08-01
Importing data was modified so that empty lines at the end of the data file are removed. (Note that empty lines between non-empty data lines remain illegal and generate an error.)
Fisher's exact test for comparing prevalences and Mood's median test comparing median intensities provide now simulated p-values (based on 20000 Monte Carlo replications) when the samples are too large for computing an exact p-value.

1.0.3     2013-07-07
A bug related to importing data files without column names and with blank delimiter is fixed.
Some tooltips are added.
Handling missing data is improved in the following procedures:
- Confidence interval for prevalence (Clopper-Pearson CI),
- Confidence interval for prevalence (Blaker's method, shorter CI),
- Confidence interval for prevalence (Sterne's method, shorter CI),
- Confidence interval for mean intensity (Bootstrap BCa),
- Confidence interval for mean abundance (Bootstrap BCa),
- Confidence interval for mean crowding (Bootstrap BCa),
- Comparison of prevalences (Chi-square test),
- Comparison of mean intensities (Bootstrap t-test),
- Comparison of mean abundances (Bootstrap t-test).

1.0.2     2013-06-21
This "Notes to QPweb" is provided.
Import data: a bug related to the use of decimal comma is fixed. Data files with decimal comma can be imported correctly now.
Aggregation indices:   when the negative binomial exponent k cannot be estimated (e.g. because data are not at all aggregated or extremely deviate from the negative binomial), NA is reported. (Previously either nothing or a huge number was reported.)
Comparison of mean abundances:   this analysis modul was accidentally deleted, so choosing this analysis, no results were produced. Now it is replaced.


Known issues

Top of page (Tips)     Version history

In some cases the analysis aborts without any sensible error message. Some examples of this:
- If all values of a variable are missing (for example due to an erroneous uploading of data)
- If all values of a variable are same, that is, if the variance of the variable is zero
- If mean intensity is to be calculated and there is no infected host in the sample
We'll fix these by checking the conditions and issuing appropriate error messages.