This chapter introduces some of Perl's internal data structures, and the information presented here serves as a reference for the rest of the book. This chapter will be useful not only for programmers who want to add their own extensions to Perl but also for those who simply want to look at the Perl source code to see what's "under the hood."
Perl is written primarily in C and has libraries to which you can link in your own C/C++ code. To perform this linking, however, your programs have to know how Perl stores its own data structures as well as how to interpret Perl's data types.
The information presented in this chapter also will show you where to look for files, structures, and so on. There will be times when you are writing extensions or modules that you'll need to look up specific data structure definitions. The functions defined here are called from your extension's C sources.
Version 5.002b was the latest Perl release at the time this book was written. The b stands for beta; therefore, some changes to the source tree are quite possible. What you see here will not only be a snapshot in time of the source tree for 5.002b, but will also serve as a basis for you to do your own research.
The information in this chapter is about the functions you can call from C functions that interact with Perl variables. Your C code could be calling the Perl functions, or your Perl function could be calling your C code as part of an extension. The C functions have to be linked with the Perl libraries and also require the header files in your Perl distribution.
The compiler is guaranteed to work with Perl on almost all platforms in the GNU C compiler. If you have problems compiling with other commercial compilers, then get the GNU compiler from the Net. A good place to try is the oak.oakland.edu ftp site.
This section covers some of the header files in your Perl distribution.
Table 25.1 provides a brief description of what the ones covered
here contain. You can track the values or specific definitions
by starting from these header files.
File | Purpose |
XSUB.h | Defines the XSUB interface (see Chapter 27, "Writing Extensions in C," for more information) |
av.h | Array variable information |
config.h | Generated when Perl is installed |
cop.h | Glob pointers |
cv.h | Conversion structure |
dosish.h | Redefining stat, fstat, and fflush for DOS |
embed.h | For embedding Perl in C |
EXTERN.h | Global and external variables |
form.h | For form feed and line feed definitions |
gv.h | Glob pointer definitions |
handy.h | Used for embedding Perl in C |
hv.h | Hash definitions |
INTERN.h | Perl internal variables |
keywords.h | For Perl keywords |
mg.h | Definitions for using MAGIC structures |
op.h | Perl operators |
patchlevel.h | For current patchlevel information |
pp.h | For preprocessor directives |
perl.h | Main header for Perl |
perly.h | For the yylex parser |
proto.h | Function prototypes |
regexp.h | Regular expressions |
scope.h | Scoping rule definitions |
sv.h | Scalar values |
util.h | Blank header file |
unixish.h | For UNIX-specific definitions |
The source files in the Perl distribution are as follows. They come with very sparse comments.
av.c | mg.c | pp_sys.c |
deb.c | miniperlmain.c | regcomp.c |
doio.c | op.c | regexec.c |
doop.c | perl.c | run.c |
dump.c | perlmain.c | scope.c |
globals.c | perly.c | sv.c |
gv.c | pp.c | taint.c |
hv.c | pp_ctl.c | toke.c |
malloc.c | pp_hot.c | util.c |
The name of each file gives a hint as to what the code in the file does. Run head *.c > text to get a list of the headers for the files.
Now that you know a little about what source files to consult, you're ready to learn about the building blocks of Perl programs: the variables.
Perl has three basic data types: scalars, arrays, and hashes. Perl enables you to have references to these data types as well as references to subroutines. Most references use scalar values to store their values, but you can have arrays of arrays, arrays of references, and so on. It's quite possible to build complicated data structures using the three basic types in Perl.
Variables in Perl programs can even have two types of values, depending on how they are interpreted. For instance, $i can be an integer when used in a numeric operation, and $i is a string when used in a string operation. Another example is the $!, which is the errno code when used as a number but a string when used within a print statement.
Because variables internal to Perl source code can have many types of values and definitions, the name must be descriptive enough to indicate what type it is. By convention, there are three tokens in Perl source code variable names: arrays, hashes, and scalar variables. A scalar variable can be further qualified to define the type of value it holds. The list of token prefixes for these Perl types are shown in the following list:
AV | Array variables |
HV | Hash variables |
SV | Generic scalar variables |
I32 | 32-bit integer (scalar) |
I16 | 16-bit integer (scalar) |
IV | Integer or pointer only (scalar) |
NV | Double only (scalar) |
PV | String pointer only (scalar) |
If you see SV in a function or variable name, the function is probably working on a scalar item. The convention is followed closely in the Perl source code, and you should be able to glean the type of most variable names as you scan through the code. Function names in the source code can begin with sv_ for scalar variables and related operations, av_ for array variables, and hv_ for hashes.
Let's now cover these variable types and the functions to manipulate them.
Scalar variables in the Perl source are those with SV in their names. A scalar variable on a given system is the size of a pointer or an integer, whichever is larger. Specific types of scalars exist to specify numbers such as IV for integer or pointer, NV for doubles, and so on. The SV definition is really a typedef declaration of the sv structure in the header file called sv.h. NV, IV, PV, I32, and I16 are type-specific definitions of SV for doubles, generic pointers, strings, and 32- and 16-bit numbers.
Floating-point numbers and integers in Perl are stored as doubles. Thus, a variable with NV will be a double that you can cast in a C program to whatever type you want.
Four types of routines exist to create an SV variable. All four return a pointer to a newly created variable. You call these routines from within an XS Perl extension file:
The way to read these function declarations is as follows. Take the newSViv(IV) declaration, for example. The new portion of the declaration asks Perl to create a new object. The SV indicates a scalar variable. The iv indicates a specific type to create: iv for integer, nv for double, pv for a string of a specified length, and sv for all other types of scalars.
Three functions exist to get the value stored in an SV. The type of value returned depends on what type of value was set at the time of creation:
int SvIV(SV*); | This function returns an integer value of the SV being pointed to. Cast the return value to a pointer if that is how you intend to use it. A sister macro, SvIVX(SV*), does the same thing as the SvIV function. |
double SvNV(SV*); | This function returns a floating-point number. The SvNVX macro does the same thing as the SvNV() function. |
char *SvPV(SV*, STRLEN len); | This function returns a pointer to a char. The STRLEN in this function call is really specifying a pointer to the len variable. The pointer to len is used by the function to return the length of the string in SV. The SvPVX pointer returns the string too, but you do not have to specify the STRLEN len argument. |
You can modify the value contained in an already existing SV by using the following functions:
void sv_setiv(SV* ptr, IV incoming); | This function sets the value of the SV being pointed to by ptr to the integer value in incoming. |
void sv_setnv(SV* ptr, double); | This function sets the value of the SV being pointed to by ptr to the value in incoming. |
void sv_setsv(SV* dst, SV*src); | This function sets the value of the SV being pointed to by dst to the value pointed to by src in incoming. |
Perl does not keep NULL-terminated strings like C does. In fact, Perl strings can have multiple NULLs in them. Perl tracks strings by a pointer and the length of the string. Strings can be modified in one of these ways:
void sv_setpvn(SV* ptr, char* anyt , int len);
This sets the value of the SV being pointed to by ptr to the value in anyt. The string anyt contains an array of char items and does not have to be a NULL-terminated string. In fact, the string anyt can contain NULLs because the function uses the value in len to keep the string in memory.
void sv_setpv(SV* ptr, char* nullt);
This sets the value of the SV being pointed to by ptr to the value in nullt. The nullt string is a NULL-terminated string like those in C, and the function calculates and sets the length for you automatically.
SvGROW(SV* ptr, STRLEN newlen);
This function increases the size of a string to the size in newlen. You cannot decrease the size of a string using this function. Make a new variable and copy into it. You can use the function SvCUR(SV*) to get the length of a string and SvCUR_set(SV*, I32 length) to set the length of a string.
void sv_catpv(SV* ptr, char*);
This function appends a NULL-terminated string to a string in SV.
void sv_catpvn(SV* ptr, char*, int);
This function appends a string of length len to the SV pointed at by ptr.
void sv_catsv(SV*dst, SV*src);
This appends another SV* to an SV.
Your C program using these programs will crash if you are not careful enough to check whether these variables exist. To check whether a scalar variable exists, you can call these functions:
SvPOK(SV*ptr) | For string |
SvIOK(SV*ptr) | For integer |
SvNOK(SV*ptr) | For double |
SvTRUE(SV *ptr) | For Boolean value |
A value of FALSE received from these functions means that the variable does not exist. You can only get two returned values, either TRUE or FALSE, from the functions that check whether a variable is a string, integer, or double. The SvTRUE(SV*) macro returns 0 if the value pointed at by SV is an integer zero or if SV does not exist. Two other global variables, sv_yes and sv_no, can be used instead of TRUE and FALSE, respectively.
Note |
The Perl scalar undef value is stored in an SV instance called sv_undef. The sv_undef value is not (SV *) 0 as you would expect in most versions of C. |
You can get a pointer to an existing scalar by specifying its variable name in the call to the function:
SV* perl_get_sv("myScalar", FALSE);
The FALSE parameter requests the function to return sv_undef if the variable does not exist. If you specify a TRUE value as the second parameter, a new scalar variable is created for you and assigned the name myScalar in the current name space.
In fact, you can use package names in the variable name. For example, the following call creates a variable called desk in the VRML package:
SV *desk;
desk = perl_get_sv("VRML::desk", FALSE);
Now let's look at collections of scalars: arrays.
The functions for handling array variables are similar in operation to those for scalar variables. To create an array called myarray, you would use this call:
AV *myarray = (AV* ) newAV();
To get the array by specifying the name, you can also use the following function. This perl_get_av() returns NULL if the variable does not exist:
AV* perl_get_av(char *myarray, bool makeIt);
The makeIt variable can be set to TRUE if you want the array created, and FALSE if you are merely checking for its existence and do not want the array created if it does not exist.
To initialize an array at the time of creation, you can use the av_make() function. Here's the syntax for the av_make() function:
AV *myarray = (AV *)av_make(I32 num, SV **data);
The num parameter is the size of the AV array, and data is a pointer to an array of pointers to scalars to add to this new array called myarray. Do you see how the call uses pointers to SV, rather than SVs? The added level of indirection permits Perl to store any type of SV in an array. So, you can store strings, integers, and doubles all in one array in Perl. The array passed into the av_make() function is copied into a new memory area; therefore, the original data array does not have to persist.
Check the av.c source file in your Perl distribution for more details on the functions and their parameters. Here is a quick list of the functions you would most likely perform on AVs.
void av_push(AV *ptr, SV *item);
Pushes an item to the back of an array.
SV* av_pop(AV *ptr);
Pops an item off the back of an array.
SV* av_shift(AV *ptr);
Removes an item from the front of the array.
void av_unshift(AV *ptr, I32 num);
Inserts num items into the front of the array. The operation in this function only creates space for you. You still have to call the av_store() function (defined below) to assign values to the newly added items.
I32 av_len(AV *ptr);
Returns the highest index in array.
SV** av_fetch(AV *ptr, I32 offset, I32 lval);
Gets the value in the array at the offset. If lval is a nonzero value, the value at the offset is replaced with the value of lval.
SV** av_store(AV *ptr, I32 key, SV* item);
Stores the value of the item at the offset.
void av_clear(AV *ptr);
Sets all items to zero but does not destroy the array in *ptr.
void av_undef(AV *ptr);
Removes the array and all its items.
void av_extend(AV *ptr, I32 size);
Resizes the array to the maximum of the current size or the passed size.
Hash variables have HV in their names and are created in a manner similar to creating array functions. To create an HV type, you call this function:
HV* newHV();
Here's how to use an existing hash function and refer to it by name:
HV* perl_get_hv("myHash", FALSE);
The function returns NULL if the variable does not exist. If the hash does not already exist and you want Perl to create the variable for you, use:
HV* perl_get_hv("myHash", TRUE);
As with the AV type, you can perform the following functions on an HV type of variable:
Check the file hv.c in your Perl distribution for the function source file for details about how the hash function is defined. Both of the previous functions return pointers to pointers. The return value from either function will be NULL.
The following functions are defined in the source file:
bool hv_exists(HV*, char* key, U32 klen);
This function returns TRUE or FALSE.
SV* hv_delete(HV*, char* key, U32 klen, I32 flags);
This function deletes the item, if it exists, at the specified key.
void hv_clear(HV*);
This function leaves the hash but removes all its items.
void hv_undef(HV*);
This function removes the hash and its items.
You can iterate through the hash table using indexes and pointers to hash table entries using the HE pointer type. To iterate through the array (such as with the each command in Perl), you can use hv_iterinit(HV*) to set the starting point and then get the next item as an HE pointer from a call to the hv_iternext(HV*) function. To get the item being traversed, make a call to this function:
SV* hv_iterval(HV* hashptr, HE* entry);
The next SV is available via a call to this function:
SV* hv_iternextsv(HV*hptr, char** key, I32* retlen);
The key and retlen arguments are return values for the key and its length. See line 600 in the hv.c.
Values in Perl exist until explicitly freed. They are freed by the Perl garbage collector when the reference count to them is zero, by a call to the undef function, or if they were declared local or my and the scope no longer exists. In all other cases, variables declared in one scope persist even after execution has left the code block in which they were declared. For example, declaring and using $a in a function keeps $a in the main program even after returning from the subroutine. This is why it's necessary to create local variables in subroutines using the my keyword so that the Perl interpreter will automatically destroy these variables, which will no longer be used after the subroutine returns.
References to variables in Perl can also be modified using the following functions:
int SvREFCNT(SV* sv);
This function returns the current reference count to an existing SV.
void SvREFCNT_inc(SV* sv);
This function increments the current reference count.
void SvREFCNT_dec(SV* sv);
This function decrements the current reference count. You can make the reference count be zero to delete the SV being pointed to and let the garbage handler get rid
of it.
Because the values declared within code blocks persist for a long time, they are referred to as immortal. Sometimes declaring and creating variable names in code blocks have the side effect of persisting even if you do not want them to. When writing code that declares and creates such variables, it's a good idea to create variables that you do not want to persist as mortal; that is, they die when code leaves the current scope.
The functions that create a mortal variable are as follows:
SV* sv_newmortal(); | This function creates a new mortal variable and returns a pointer to it. |
SV* sv_2mortal(SV*); | This function converts an existing immortal SV into a mortal variable. Be careful not to convert an already mortal SV into a mortal SV because this operation may result in the reference count for the variable to be decremented twice, leading to unpredictable results. |
SV* sv_mortalcopy(SV*); | This function copies an existing SV (without regard to the mortality of the passed SV) into a new mortal SV. |
To create AV and HV types, you have to cast the input parameters to and from these three functions as AV* and HV*.
Perl subroutines use the stack to get and return values to the callers. Chapter 27, "Writing Extensions in C," covers how the stack is manipulated. This section describes the functions available for you to manipulate the stack.
Note |
Look in the XSUB.h file for more details than this chapter can give you. The details in the header include macro definitions for manipulating the stack in an extension module. |
Arguments on a stack to a Perl function are available via the ST(n) macro set. The topmost item on the stack is ST(0), and the mth one is ST(m-1). You may assign the return value to a static value, like this:
SV *arg1 = ST(1); // Assign argument[1] to arg1;
You can even increase the size of the argument stack in a function. (This is necessary if you are returning a list from a function call, for example. I cover this in more detail in Chapter 27.) To increase the length of the stack, make a call to the macro:
EXTEND(sp, num);
sp is the stack pointer and num is an extra number of elements to add to the stack. You cannot decrease the size of the stack.
To add items to the stack, you have to specify the type of variable you're adding. Four functions are available for four of the most generic types to push:
If you want the stack to be adjusted automatically, make the calls to these macros:
These macros are a bit slower but simpler to use.
In Chapter 27, you'll see how to use stacks in the section titled "The typemap File." Basically, a typemap file is used by the extensions compiler xsubpp for the rules to convert from Perl's internal data types (hash, array, and so on) to C's data types (int, char *, and so on). These rules are stored in the typemap file in your Perl distribution's ./lib/ExtUtils directory.
The definitions of the structures in the typemap file are specified in the internal format for Perl.
As you go through the online docs and the source for Perl, you'll often see the word magic. The mysterious connotations of this word are further enhanced by the almost complete lack of documentation on what magic really is. In order to understand the phrases "then magic is applied to whatever" or "automagically [sic]" in the Perl documentation, you have to know what "magic" in Perl really means. Perhaps after reading this section, you will have a better feel for Perl internal structures and actions of the Perl interpreter.
Basically, a scalar value in Perl can have special features for it to become "magical." When you apply magic to a variable, that variable is placed into a linked list for methods. A method is called for each type of magic assigned to that variable when a certain action takes place, such as retrieving or storing the contents of the variable. Please refer to a comparable scheme in Perl when using the tie() function as described in Chapter 6, "Binding Variables to Objects." When tie-ing a variable to an action, you are defining actions to take when a scalar is accessed or when an array item is read from. In the case of magic actions of a scalar, you have a set of magic methods that are called when the Perl interpreter takes a similar action on (like getting a value from or putting a value into) a scalar variable.
To check whether a variable has magic methods associated with it, you can get the flags for it using the SvFLAGS(sv) macro. The (sv) here is the name of the variable. The SvMAGIC(variable) macro returns the list of methods that are magically applied to the variable. The SvTYPE() of the variable is SVt_PVMG if it has a list of methods. A normal SV type is upgraded to a magical status by Perl if a method is requested for it.
The structure to maintain the list is found in the file mg.h in the Perl source files:
struct magic {
MAGIC* mg_moremagic;
// pointer to next method. NULL if none.
MGVTBL* mg_virtual;
// pointer to table of methods.
U16 mg_private; // internal variable
char mg_type; // type of methods
U8 mg_flags; // flags for this method
SV* mg_obj; // Reference to itself
char* mg_ptr; // name of the magic variable
I32 mg_len; // length of the name
};
The mg_type value sets up
how the magic function is applied. The following items are used
in the magic table. You can see the values in use in the sv.c
file at about line 1950. Table 25.2, which has been constructed
from the switch statement,
tells you how methods have to be applied.
Virtual Magic Table | Action Calling Method | |
vtbl_sv | Null operation | |
vtbl_amagic | Operator overloading | |
vtbl_amagicelem | Operator overloading | |
0 | Used in operator overloading | |
vtbl_bm | Unknown | |
vtbl_env | %ENV hash | |
vtbl_envelem | %ENV hash element | |
vtbl_mglob | Regexp applied globally | |
vtbl_isa | @ISA array | |
vtbl_isaelem | @ISA array element | |
0 | Unknown | |
tbl_dbline | n line debugger | |
tbl_pack | Tied array or hash | |
vtbl_packelem | Tied array or hash element | |
vtbl_packelem | Tied scalar or handle | |
vtbl_sig | Signal hash | |
vtbl_sigelem | Signal hash element | |
vtbl_taint | Modified tainted variable | |
vtbl_uvar | Unknown variable type | |
vtbl_vec | Vector | |
vtbl_substr | Substring | |
vtbl_glob | The GV type | |
vtbl_arylen | Array length | |
vtbl_pos | $. scalar variable |
The magic virtual tables are defined in embed.h. The mg_virtual field in each magic entry is assigned to the address of the virtual table.
Each entry in the magic virtual table has five items, each of which is defined in the following structure in the file mg.h:
struct mgvtbl {
int (*svt_get) _((SV *sv, MAGIC* mg));
int (*svt_set) _((SV *sv, MAGIC* mg));
U32 (*svt_len) _((SV *sv, MAGIC* mg));
int (*svt_clear) _((SV *sv, MAGIC* mg));
int (*svt_free) _((SV *sv, MAGIC* mg));
};
The svt_get() function is called when the data in SV is retrieved. The svt_set() function is called when the data in SV is stored. The svt_len() function is called when the length of the string is changed. The svt_clear() function is called when SV is cleared, and the svt_free() function is called when SV is destroyed.
All tables shown in the perl.h file are assigned mgvtbl structures. The values in each mgvtbl structure for each item in a table define a function to call when an action that affects entries in this table is taken by the Perl interpreter. Here is an excerpt from the file:
EXT MGVTBL vtbl_sv =
{magic_get, magic_set, magic_len,0,0};
EXT MGVTBL vtbl_env =
{0, 0, 0, 0, 0};
EXT MGVTBL vtbl_envelem =
{0, magic_setenv, 0,magic_clearenv, 0};
EXT MGVTBL vtbl_sig =
{0, 0, 0, 0, 0};
EXT MGVTBL vtbl_sigelem =
{0, magic_setsig, 0,0, 0};
The vbtl_sv is set to call three methods: magic_get(), magic_set(), and magic_len() for the magic entries in sv. The zeros for vtbl_sig indicate that no magic methods are called.
If you are still awake, you'll notice a reference to GV in the source file. GV stands for global variable, and the value stored in GV is any data type from scalar to a subroutine reference. GV entries are stored in a hash table, and the keys to each entry are the names of the symbols being stored. A hash table with GV entries is also referred to as a stash. Internally, a GV type is the same as an HV type.
Keys in a stash are also package names, with the data item pointing to other GV tables containing the symbol within the package.
Most of the information in this chapter has been gleaned from source files or the online documents on the Internet. There is a somewhat old file called perlguts.html by Jeff Okamoto (e-mail okamoto@corp.hp.com) in the www.metronet.com archives that has the Perl API functions and information about the internals.
Note that the perlguts.html file was dated 1/27/1995, so it's probably not as up-to-date as you would like.
Please refer to the perlguts.html or the perlguts man page for a comprehensive listing of the Perl API. If you want a listing of the functions in the Perl source code and the search strings, use the ctags *.c command on all the .c files in the Perl source directory. The result will be a very long file (800 lines), called tags, in the same directory. A header of this file is shown here:
ELSIF perly.c /^"else : ELSIF '(' expr ')' block else",$/
GvAVn gv.c /^AV *GvAVn(gv)$/
GvHVn gv.c /^HV *GvHVn(gv)$/
Gv_AMupdate gv.c /^Gv_AMupdate(stash)$/
HTOV util.c /^HTOV(htovs,short)$/
PP pp.c /^PP(pp_abs)$/
PP pp.c /^PP(pp_anoncode)$/
PP pp.c /^PP(pp_anonhash)$/
PP pp.c /^PP(pp_anonlist)$/
PP pp.c /^PP(pp_aslice)$/
If you are a vi hack, you can type :tag functionName to go to the line and file immediately from within a vi session. Ah, the old vi editor still has a useful function in this day and age. emacs users can issue the command etags *.c and get a comparable tags file for use with the M-x find-tag command in emacs.
This chapter is a reference-only chapter to prepare you for what lies ahead in the rest of the book. You'll probably be referring to this chapter quite a bit as you write include extensions.
There are three basic types of variables in Perl: SV for scalar, AV for arrays, and HV for hash values. Macros exist for getting data from one type to another. You'll need to know about these internal data types if you're going to be writing Perl extensions, dealing with platform-dependent issues, or (ugh) embedding C code in Perl and vice versa. The dry information in this chapter will serve you well in the rest of this book.